HPSTM-Gen extends the Human Pose Smoothing with Transformer and Manifold Model (HPSTM) into a generative model for human pose trajectories.
Unlike traditional discriminative models that only denoise or smooth existing pose sequences, HPSTM-Gen learns the full distribution of physically plausible trajectories. This enables the model to sample new, diverse, and anatomically valid human motions conditioned on visual, language, or context cues.
Key features:
- Diffusion/Flow-matching based generative training
- Transformer backbone with global temporal attention
- Explicit kinematic constraints via differentiable forward kinematics (FK)
- Optional multi-modal context (vision/language/robot state)
- Physically plausible sampling and uncertainty estimation
Method
1. Model Architecture
HPSTM-Gen retains the core architecture of the original HPSTM:
- Input: A sequence of (possibly noisy) 3D joint positions, optionally with context (e.g., language/vision embeddings)
- Backbone: Encoder-decoder Transformer, capturing long-range temporal dependencies
-
Output:
- Refined/sampled 3D pose sequences
- (Optional) Per-joint covariance estimates
- All outputs are mapped via a differentiable forward-kinematics layer, enforcing manifold (bone-length) constraints
2. Generative Training with Flow Matching
Instead of only learning to denoise trajectories, HPSTM-Gen is trained to model the full data distribution using flow matching (or optionally score-based diffusion).
A. Noise Injection (Data Corruption)
For each training sample:
B. Flow Matching Loss
The model learns to predict the “denoising direction”—the vector field that maps noisy samples back to the data manifold.
\[\mathcal{L}_\text{FM} = \mathbb{E}_{\mathbf{X}_\text{data}, \tau, \epsilon}\left[ \left\| f_\theta(\mathbf{X}_\tau, \text{context}) - (\epsilon - \mathbf{X}_\text{data}) \right\|^2 \right]\]
- $f_\theta$: HPSTM-Gen’s prediction of the denoising vector, given the noisy input
- The output is passed through the FK layer to enforce anatomical plausibility
C. Auxiliary Losses (Optional but recommended)
- Bone Length Consistency: Penalize deviation from canonical bone lengths
- Smoothness: Temporal velocity and acceleration losses for physically realistic motion
- Negative Log-Likelihood (NLL): If predicting covariance, add NLL of predicted distributions
D. Full Training Objective
\[\mathcal{L} = \mathcal{L}_\text{FM} + \lambda_\text{bone} \mathcal{L}_\text{bone} + \lambda_\text{vel} \mathcal{L}_\text{vel} + \lambda_\text{accel} \mathcal{L}_\text{accel} + \lambda_\text{NLL} \mathcal{L}_\text{NLL}\]
All terms can be weighted based on task priorities.
3. Sampling (Inference) Procedure
To generate a new pose sequence:
- Initialize with a pure Gaussian noise sequence: $\mathbf{X}_0 \sim \mathcal{N}(0, I)$
-
Iteratively Denoise: For a chosen number of steps, repeatedly input the current sequence into HPSTM-Gen (optionally with context), and update using the predicted denoising vector:
\[\mathbf{X}_{k+1} = \mathbf{X}_k + \alpha_k f_\theta(\mathbf{X}_k, \text{context})\]
where $\alpha_k$ is the step size at iteration $k$
- Manifold Projection: After each step, pass the output through the FK layer to ensure bone-length and anatomical constraints.
- Obtain Final Output: The last sequence is a physically plausible, context-conditioned motion sampled from the learned distribution.
This process is analogous to diffusion/score-based generative models in vision, but tailored to structured, constrained motion data.
4. Implementation Notes
- You can leverage existing diffusion model frameworks (e.g., DiffusionPolicy, π0, Score-based Models) and plug in the HPSTM backbone.
- The FK layer and bone-length losses are essential for anatomical validity—do not skip them!
- Sampling step sizes, number of iterations, and noise schedule can be tuned based on downstream requirements.
Getting Started
-
Requirements
- PyTorch >= 1.10
- numpy, tqdm, etc.
- (Optional) Visualization toolkit (e.g., matplotlib, open3d)
-
Dataset
- Human pose sequences (e.g., AMASS, 3DPW, or your own motion capture)
- (Optional) Context data: language instructions, video/image embeddings, etc.
-
Training
- Use
train.py
(example to be provided)
- Specify model config (Transformer depth/width, FK parameters, loss weights)
- Train with both clean and noise-corrupted data as above
-
Sampling
- Use
sample.py
to generate new motions given a context
- Visualize and/or retarget to a robot arm via your pose-mapping module
Citation
If you use HPSTM-Gen in your research, please cite:
@article{your2025hpstmgen,
title={HPSTM-Gen: Generative Human Pose Sequence Tracking via Flow Matching},
author={Your Name and Collaborators},
journal={arXiv preprint arXiv:XXXXX},
year={2025}
}
Acknowledgments
This work extends the HPSTM model (link to your repo)
Inspired by π0, Diffusion Policy, and recent advances in generative robotics.
Questions or issues?
Open an issue or email qifei@seas.upenn.edu