HPSTM-Gen extends the Human Pose Smoothing with Transformer and Manifold Model (HPSTM) into a generative model for human pose trajectories. Unlike traditional discriminative models that only denoise or smooth existing pose sequences, HPSTM-Gen learns the full distribution of physically plausible trajectories. This enables the model to sample new, diverse, and anatomically valid human motions conditioned on visual, language, or context cues.

Key features:


Method

1. Model Architecture

HPSTM-Gen retains the core architecture of the original HPSTM:

2. Generative Training with Flow Matching

Instead of only learning to denoise trajectories, HPSTM-Gen is trained to model the full data distribution using flow matching (or optionally score-based diffusion).

A. Noise Injection (Data Corruption)

For each training sample:

B. Flow Matching Loss

The model learns to predict the “denoising direction”—the vector field that maps noisy samples back to the data manifold.

\[\mathcal{L}_\text{FM} = \mathbb{E}_{\mathbf{X}_\text{data}, \tau, \epsilon}\left[ \left\| f_\theta(\mathbf{X}_\tau, \text{context}) - (\epsilon - \mathbf{X}_\text{data}) \right\|^2 \right]\]

D. Full Training Objective

\[\mathcal{L} = \mathcal{L}_\text{FM} + \lambda_\text{bone} \mathcal{L}_\text{bone} + \lambda_\text{vel} \mathcal{L}_\text{vel} + \lambda_\text{accel} \mathcal{L}_\text{accel} + \lambda_\text{NLL} \mathcal{L}_\text{NLL}\]

All terms can be weighted based on task priorities.


3. Sampling (Inference) Procedure

To generate a new pose sequence:

  1. Initialize with a pure Gaussian noise sequence: $\mathbf{X}_0 \sim \mathcal{N}(0, I)$
  2. Iteratively Denoise: For a chosen number of steps, repeatedly input the current sequence into HPSTM-Gen (optionally with context), and update using the predicted denoising vector:

    \[\mathbf{X}_{k+1} = \mathbf{X}_k + \alpha_k f_\theta(\mathbf{X}_k, \text{context})\]

    where $\alpha_k$ is the step size at iteration $k$

  3. Manifold Projection: After each step, pass the output through the FK layer to ensure bone-length and anatomical constraints.
  4. Obtain Final Output: The last sequence is a physically plausible, context-conditioned motion sampled from the learned distribution.

This process is analogous to diffusion/score-based generative models in vision, but tailored to structured, constrained motion data.


4. Implementation Notes


Getting Started

  1. Requirements

  2. Dataset

  3. Training

  4. Sampling


Citation

If you use HPSTM-Gen in your research, please cite:

@article{your2025hpstmgen,
  title={HPSTM-Gen: Generative Human Pose Sequence Tracking via Flow Matching},
  author={Your Name and Collaborators},
  journal={arXiv preprint arXiv:XXXXX},
  year={2025}
}

Acknowledgments

This work extends the HPSTM model (link to your repo) Inspired by π0, Diffusion Policy, and recent advances in generative robotics.


Contact

Questions or issues? Open an issue or email qifei@seas.upenn.edu