From HPSTM to HPSTM-Gen

2 minute read

Published: June 05, 2025

Why we’re turning our pose-smoothing Transformer into a generative motion model for VLA research

What HPSTM already does

HPSTM was born as a high-fidelity smoother: it takes jittery 3-D skeletons from a monocular tracker, passes them through a Transformer with a forward-kinematics (FK) layer, and outputs anatomically valid, low-jerk trajectories. For tasks like vision-to-robot imitation that single step alone is priceless.

Why “just smoothing” is no longer enough

Recent vision-language-action (VLA) models—most notably π0—show that sampling actions from a learned distribution is a game-changer. π0’s flow-matching action expert can start from pure Gaussian noise and iteratively “denoise” its way to a fluent 50 Hz robot control chunk. If HPSTM could do the same for human motion we would gain:

Need	Benefit of a generative HPSTM
Data augmentation	Cheap, limitless synthetic trajectories for imitation learning
Planning diversity	Multiple motion roll-outs instead of one “best guess”
Probabilistic control	Physics-aware uncertainty that downstream controllers can exploit

Design sketch: HPSTM-Gen

We keep 90 % of the original network and swap only the learning objective and the inference loop.

Component	Original HPSTM	HPSTM-Gen
Backbone	Encoder-decoder Transformer	unchanged
Manifold constraint	Differentiable FK	unchanged
Training input	Noisy pose → clean pose	Mixed-noise pose $X_τ$
Core loss	L_pos, bone, smooth	Flow-matching loss
Inference	One forward pass	10-step iterative denoising from pure noise

Flow-matching loss

\[\mathcal L_\text{FM}\;=\; \mathbb E_{\mathbf{X}_{\text{data}},\;\tau,\;\epsilon} \bigl\| \,f_\theta(\mathbf{X}_\tau) - (\epsilon-\mathbf{X}_{\text{data}})\, \bigr\|^{\!2}\!, \quad \mathbf{X}_\tau = \tau\mathbf{X}_{\text{data}}+(1-\tau)\epsilon\]

Add-ons: keep bone-length, velocity and acceleration terms so every intermediate sample stays on the human kinematic manifold.

Sampling loop (Euler style)

X = torch.randn(seq_len, n_joints, 3)         
for k in range(num_steps):
    v = hpstm_gen(X, context)
    X = fk_project(X + α_k * v)
return X

What we have not done (yet)

We haven’t run quantitative studies or hardware demos; this post is about the method and the motivation. π0 already proves the recipe works for robot actions—our goal is to bring the same generative power to the human-motion side of VLA so that:

VLM ↔ HPSTM-Gen ↔ robot adapters form an end-to-end pipeline;
researchers can sample diverse, anatomically correct motions without motion-capture labour.

Next steps

Implement the flow-matching loss in the existing HPSTM repo (branch hpstm-gen).
Prototype a 10-step sampler; visual sanity checks first, quantitative metrics later.
Retarget sampled motions onto a 4-DoF uArm for a live “noise → motion” demo.

If you’re working on VLA, diffusion policy, or robot retargeting and want to join forces, feel free to open an issue or reach me at qifei@seas.upenn.edu.

Turning HPSTM into a sampler is less about chasing novelty and more about giving the VLA community a lightweight, anatomically aware generator that plays nicely with π0-style action experts. Stay tuned for code drops and early videos!

Qifei Cui

From HPSTM to HPSTM-Gen

What HPSTM already does

Why “just smoothing” is no longer enough

Design sketch: HPSTM-Gen

Flow-matching loss

Sampling loop (Euler style)

What we have not done (yet)

Next steps

You May Also Enjoy

Black Truffle Basque Cheesecake: A Fusion of French Elegance and Spanish Tradition

GRPO-RS: A Next-Gen Graph-based Reinforcement Learning Framework for Recommender Systems

Bridging HPSTM and SKEL

Camera Pose Expressions for HPSTM-Skeleton-NeRF Integration