From HPSTM to HPSTM-Gen

2 minute read

Published:

Why we’re turning our pose-smoothing Transformer into a generative motion model for VLA research

What HPSTM already does

HPSTM was born as a high-fidelity smoother: it takes jittery 3-D skeletons from a monocular tracker, passes them through a Transformer with a forward-kinematics (FK) layer, and outputs anatomically valid, low-jerk trajectories. For tasks like vision-to-robot imitation that single step alone is priceless.

Why “just smoothing” is no longer enough

Recent vision-language-action (VLA) models—most notably π0—show that sampling actions from a learned distribution is a game-changer. π0’s flow-matching action expert can start from pure Gaussian noise and iteratively “denoise” its way to a fluent 50 Hz robot control chunk. If HPSTM could do the same for human motion we would gain:

NeedBenefit of a generative HPSTM
Data augmentationCheap, limitless synthetic trajectories for imitation learning
Planning diversityMultiple motion roll-outs instead of one “best guess”
Probabilistic controlPhysics-aware uncertainty that downstream controllers can exploit

Design sketch: HPSTM-Gen

We keep 90 % of the original network and swap only the learning objective and the inference loop.

ComponentOriginal HPSTMHPSTM-Gen
BackboneEncoder-decoder Transformerunchanged
Manifold constraintDifferentiable FKunchanged
Training inputNoisy pose → clean poseMixed-noise pose $X_τ$
Core lossLpos, bone, smoothFlow-matching loss
InferenceOne forward pass10-step iterative denoising from pure noise

Flow-matching loss

\[\mathcal L_\text{FM}\;=\; \mathbb E_{\mathbf{X}_{\text{data}},\;\tau,\;\epsilon} \bigl\| \,f_\theta(\mathbf{X}_\tau) - (\epsilon-\mathbf{X}_{\text{data}})\, \bigr\|^{\!2}\!, \quad \mathbf{X}_\tau = \tau\mathbf{X}_{\text{data}}+(1-\tau)\epsilon\]

Add-ons: keep bone-length, velocity and acceleration terms so every intermediate sample stays on the human kinematic manifold.

Sampling loop (Euler style)

X = torch.randn(seq_len, n_joints, 3)         
for k in range(num_steps):
    v = hpstm_gen(X, context)
    X = fk_project(X + α_k * v)
return X

What we have not done (yet)

We haven’t run quantitative studies or hardware demos; this post is about the method and the motivation. π0 already proves the recipe works for robot actions—our goal is to bring the same generative power to the human-motion side of VLA so that:

  • VLM ↔ HPSTM-Gen ↔ robot adapters form an end-to-end pipeline;
  • researchers can sample diverse, anatomically correct motions without motion-capture labour.

Next steps

  1. Implement the flow-matching loss in the existing HPSTM repo (branch hpstm-gen).
  2. Prototype a 10-step sampler; visual sanity checks first, quantitative metrics later.
  3. Retarget sampled motions onto a 4-DoF uArm for a live “noise → motion” demo.

If you’re working on VLA, diffusion policy, or robot retargeting and want to join forces, feel free to open an issue or reach me at qifei@seas.upenn.edu.


Turning HPSTM into a sampler is less about chasing novelty and more about giving the VLA community a lightweight, anatomically aware generator that plays nicely with π0-style action experts. Stay tuned for code drops and early videos!