From HPSTM to HPSTM-Gen
Published:
Why we’re turning our pose-smoothing Transformer into a generative motion model for VLA research
What HPSTM already does
HPSTM was born as a high-fidelity smoother: it takes jittery 3-D skeletons from a monocular tracker, passes them through a Transformer with a forward-kinematics (FK) layer, and outputs anatomically valid, low-jerk trajectories. For tasks like vision-to-robot imitation that single step alone is priceless.
Why “just smoothing” is no longer enough
Recent vision-language-action (VLA) models—most notably π0—show that sampling actions from a learned distribution is a game-changer. π0’s flow-matching action expert can start from pure Gaussian noise and iteratively “denoise” its way to a fluent 50 Hz robot control chunk. If HPSTM could do the same for human motion we would gain:
Need | Benefit of a generative HPSTM |
---|---|
Data augmentation | Cheap, limitless synthetic trajectories for imitation learning |
Planning diversity | Multiple motion roll-outs instead of one “best guess” |
Probabilistic control | Physics-aware uncertainty that downstream controllers can exploit |
Design sketch: HPSTM-Gen
We keep 90 % of the original network and swap only the learning objective and the inference loop.
Component | Original HPSTM | HPSTM-Gen |
---|---|---|
Backbone | Encoder-decoder Transformer | unchanged |
Manifold constraint | Differentiable FK | unchanged |
Training input | Noisy pose → clean pose | Mixed-noise pose $X_τ$ |
Core loss | Lpos, bone, smooth | Flow-matching loss |
Inference | One forward pass | 10-step iterative denoising from pure noise |
Flow-matching loss
\[\mathcal L_\text{FM}\;=\; \mathbb E_{\mathbf{X}_{\text{data}},\;\tau,\;\epsilon} \bigl\| \,f_\theta(\mathbf{X}_\tau) - (\epsilon-\mathbf{X}_{\text{data}})\, \bigr\|^{\!2}\!, \quad \mathbf{X}_\tau = \tau\mathbf{X}_{\text{data}}+(1-\tau)\epsilon\]Add-ons: keep bone-length, velocity and acceleration terms so every intermediate sample stays on the human kinematic manifold.
Sampling loop (Euler style)
X = torch.randn(seq_len, n_joints, 3)
for k in range(num_steps):
v = hpstm_gen(X, context)
X = fk_project(X + α_k * v)
return X
What we have not done (yet)
We haven’t run quantitative studies or hardware demos; this post is about the method and the motivation. π0 already proves the recipe works for robot actions—our goal is to bring the same generative power to the human-motion side of VLA so that:
- VLM ↔ HPSTM-Gen ↔ robot adapters form an end-to-end pipeline;
- researchers can sample diverse, anatomically correct motions without motion-capture labour.
Next steps
- Implement the flow-matching loss in the existing HPSTM repo (branch
hpstm-gen
). - Prototype a 10-step sampler; visual sanity checks first, quantitative metrics later.
- Retarget sampled motions onto a 4-DoF uArm for a live “noise → motion” demo.
If you’re working on VLA, diffusion policy, or robot retargeting and want to join forces, feel free to open an issue or reach me at qifei@seas.upenn.edu.
Turning HPSTM into a sampler is less about chasing novelty and more about giving the VLA community a lightweight, anatomically aware generator that plays nicely with π0-style action experts. Stay tuned for code drops and early videos!