Camera Pose Expressions for HPSTM-Skeleton-NeRF Integration

2 minute read

Published:

Skeleton serves as an explicit warp between canonical space and every posed frame. The render-and-compare loss feeds back dense gradients that refine both the NeRF and the HPSTM joint angles simultaneously.

1. Coordinate Conventions

SymbolMeaning
$K \in \mathbb{R}^{3\times3}$Intrinsic matrix, $\mathbf{K}=\begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}$
$T_{w\to c}$World→Camera homogeneous transform $(4 \times 4)$
$T_{c\to w}=T_{w\to c}^{-1}$Camera→World transform
$R_{w\to c},\; t_{w\to c}$Rotation and translation components of the extrinsic matrix
$C$Camera center in world coordinates, $\mathbf{C} = -\mathbf{R}_{w\to c}^{\top}\mathbf{t}_{w\to c}$
$W(\theta,T)$HPSTM linear blend skinning (LBS) mapping world→canonical body frame

All matrices are applied by right-multiplying column vectors.

2. Multi-View Calibrated Setting

2.1 Extrinsic Matrix

\[\mathbf{T}_{w\to c} = \begin{bmatrix} \mathbf{R}_{w\to c} & \mathbf{t}_{w\to c} \\ \mathbf{0}^{\mathsf T} & 1 \end{bmatrix}\]

2.2 Pixel-to-Ray Formula

Given a pixel $(u,v)$:

\[\mathbf{p}_h = \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}, \quad \mathbf{d}_{\text{cam}} = \mathbf{K}^{-1}\mathbf{p}_h\]

Transform to world coordinates:

\[\mathbf{d}_{\text{world}} = \mathbf{R}_{w\to c}^{\top}\,\mathbf{d}_{\text{cam}},\quad \hat{\mathbf{d}}_{\text{world}} = \frac{\mathbf{d}_{\text{world}}}{\vert\mathbf{d}_{\text{world}}\vert_2}\]

Camera origin:

\[\mathbf{o}_{\text{world}} = \mathbf{C} = -\mathbf{R}_{w\to c}^{\top}\,\mathbf{t}_{w\to c}\]

Ray equation:

\[\mathbf{r}(t) = \mathbf{o}_{\text{world}} + t\,\hat{\mathbf{d}}_{\text{world}},\quad t\in[t_{\min},t_{\max}]\]

3. Monocular SLAM Setting

3.1 Essential Matrix Decomposition

Extract relative rotation and translation between two frames:

\[\mathbf{E} = \mathbf{U}\operatorname{diag}(1,1,0)\mathbf{V}^{\top}, \quad \mathbf{R} = \mathbf{U}\mathbf{W}\mathbf{V}^{\top}, \quad \mathbf{t} = \mathbf{U}_{:,2}\]

Select the unique $(\mathbf{R},\mathbf{t})$ that places reconstructed points in front of both cameras.

3.2 Global Bundle Adjustment

Optimize over all cameras $i$ and 3-D points $j$:

\[\min_{\{\mathbf{R}_i,\mathbf{t}_i,\mathbf{X}_j\}} \sum_{i,j}\rho\bigl(\pi(\mathbf{R}_i,\mathbf{t}_i,\mathbf{X}_j) - \mathbf{x}_{ij}\bigr),\]

where $\pi(\cdot)$ is the perspective projection and $\rho$ is a robust loss.

3.3 Scale & Alignment Refinement inside NeRF

Introduce a global similarity aligner $\mathbf{A}\in SE(3)$ and a scalar scale $s$:

\[\tilde{\mathbf{T}}_{w\to c,i} = \mathbf{A}\, \begin{bmatrix} \mathbf{R}_i & s\,\mathbf{t}_i \\ \mathbf{0}^{\mathsf T} & 1 \end{bmatrix}\]

Both $\mathbf{A}$ and $s$ are jointly optimized with NeRF under additional losses:

  • Bone-length consistency
  • (Optional) Depth consistency between NeRF and an HPSTM mesh Z-buffer

4. Canonical Body Transformation

For each sample point on the ray:

\[\mathbf{x}_{\text{canonical}} = \mathbf{W}(\boldsymbol{\theta}_f,\mathbf{T}_f)^{-1} \begin{bmatrix} \mathbf{o}_{\text{world}} + t\,\hat{\mathbf{d}}_{\text{world}} \\ 1 \end{bmatrix}\]

The resulting 3-D coordinates are fed into the NeRF (or 3D Gaussian) MLP.

5. Implementation Summary (Pseudo-Code)

# Inputs: intrinsics K, extrinsics R_wc, t_wc, pixel (u, v)
p_h = torch.tensor([u, v, 1.0])
d_cam = torch.linalg.solve(K, p_h)           # K^{-1} p
d_world = R_wc.T @ d_cam
d_world = d_world / d_world.norm()
o_world = -R_wc.T @ t_wc

# Optional: transform to canonical body frame
x_canon = W_inv(theta_f, T_f) @ torch.cat([o_world + t * d_world, torch.tensor([1.0])])

6. Symbol Glossary

SymbolDescription
$u,v$Pixel coordinates (origin at top-left)
$f_x, f_y$Focal lengths in pixels
$c_x, c_y$Principal-point offsets
$t$NeRF ray-marching parameter
$\boldsymbol{\theta}_f,\mathbf{T}_f$HPSTM joint rotations & global root for frame f
$W(\cdot)$Linear blend skinning warp
$\rho$Robust loss (e.g., Huber)