Camera Pose Expressions for HPSTM-Skeleton-NeRF Integration
Published:
Skeleton serves as an explicit warp between canonical space and every posed frame. The render-and-compare loss feeds back dense gradients that refine both the NeRF and the HPSTM joint angles simultaneously.
1. Coordinate Conventions
Symbol | Meaning |
---|---|
$K \in \mathbb{R}^{3\times3}$ | Intrinsic matrix, $\mathbf{K}=\begin{bmatrix} f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 & 1 \end{bmatrix}$ |
$T_{w\to c}$ | World→Camera homogeneous transform $(4 \times 4)$ |
$T_{c\to w}=T_{w\to c}^{-1}$ | Camera→World transform |
$R_{w\to c},\; t_{w\to c}$ | Rotation and translation components of the extrinsic matrix |
$C$ | Camera center in world coordinates, $\mathbf{C} = -\mathbf{R}_{w\to c}^{\top}\mathbf{t}_{w\to c}$ |
$W(\theta,T)$ | HPSTM linear blend skinning (LBS) mapping world→canonical body frame |
All matrices are applied by right-multiplying column vectors.
2. Multi-View Calibrated Setting
2.1 Extrinsic Matrix
\[\mathbf{T}_{w\to c} = \begin{bmatrix} \mathbf{R}_{w\to c} & \mathbf{t}_{w\to c} \\ \mathbf{0}^{\mathsf T} & 1 \end{bmatrix}\]2.2 Pixel-to-Ray Formula
Given a pixel $(u,v)$:
\[\mathbf{p}_h = \begin{bmatrix} u \\ v \\ 1 \end{bmatrix}, \quad \mathbf{d}_{\text{cam}} = \mathbf{K}^{-1}\mathbf{p}_h\]Transform to world coordinates:
\[\mathbf{d}_{\text{world}} = \mathbf{R}_{w\to c}^{\top}\,\mathbf{d}_{\text{cam}},\quad \hat{\mathbf{d}}_{\text{world}} = \frac{\mathbf{d}_{\text{world}}}{\vert\mathbf{d}_{\text{world}}\vert_2}\]Camera origin:
\[\mathbf{o}_{\text{world}} = \mathbf{C} = -\mathbf{R}_{w\to c}^{\top}\,\mathbf{t}_{w\to c}\]Ray equation:
\[\mathbf{r}(t) = \mathbf{o}_{\text{world}} + t\,\hat{\mathbf{d}}_{\text{world}},\quad t\in[t_{\min},t_{\max}]\]3. Monocular SLAM Setting
3.1 Essential Matrix Decomposition
Extract relative rotation and translation between two frames:
\[\mathbf{E} = \mathbf{U}\operatorname{diag}(1,1,0)\mathbf{V}^{\top}, \quad \mathbf{R} = \mathbf{U}\mathbf{W}\mathbf{V}^{\top}, \quad \mathbf{t} = \mathbf{U}_{:,2}\]Select the unique $(\mathbf{R},\mathbf{t})$ that places reconstructed points in front of both cameras.
3.2 Global Bundle Adjustment
Optimize over all cameras $i$ and 3-D points $j$:
\[\min_{\{\mathbf{R}_i,\mathbf{t}_i,\mathbf{X}_j\}} \sum_{i,j}\rho\bigl(\pi(\mathbf{R}_i,\mathbf{t}_i,\mathbf{X}_j) - \mathbf{x}_{ij}\bigr),\]where $\pi(\cdot)$ is the perspective projection and $\rho$ is a robust loss.
3.3 Scale & Alignment Refinement inside NeRF
Introduce a global similarity aligner $\mathbf{A}\in SE(3)$ and a scalar scale $s$:
\[\tilde{\mathbf{T}}_{w\to c,i} = \mathbf{A}\, \begin{bmatrix} \mathbf{R}_i & s\,\mathbf{t}_i \\ \mathbf{0}^{\mathsf T} & 1 \end{bmatrix}\]Both $\mathbf{A}$ and $s$ are jointly optimized with NeRF under additional losses:
- Bone-length consistency
- (Optional) Depth consistency between NeRF and an HPSTM mesh Z-buffer
4. Canonical Body Transformation
For each sample point on the ray:
\[\mathbf{x}_{\text{canonical}} = \mathbf{W}(\boldsymbol{\theta}_f,\mathbf{T}_f)^{-1} \begin{bmatrix} \mathbf{o}_{\text{world}} + t\,\hat{\mathbf{d}}_{\text{world}} \\ 1 \end{bmatrix}\]The resulting 3-D coordinates are fed into the NeRF (or 3D Gaussian) MLP.
5. Implementation Summary (Pseudo-Code)
# Inputs: intrinsics K, extrinsics R_wc, t_wc, pixel (u, v)
p_h = torch.tensor([u, v, 1.0])
d_cam = torch.linalg.solve(K, p_h) # K^{-1} p
d_world = R_wc.T @ d_cam
d_world = d_world / d_world.norm()
o_world = -R_wc.T @ t_wc
# Optional: transform to canonical body frame
x_canon = W_inv(theta_f, T_f) @ torch.cat([o_world + t * d_world, torch.tensor([1.0])])
6. Symbol Glossary
Symbol | Description |
---|---|
$u,v$ | Pixel coordinates (origin at top-left) |
$f_x, f_y$ | Focal lengths in pixels |
$c_x, c_y$ | Principal-point offsets |
$t$ | NeRF ray-marching parameter |
$\boldsymbol{\theta}_f,\mathbf{T}_f$ | HPSTM joint rotations & global root for frame f |
$W(\cdot)$ | Linear blend skinning warp |
$\rho$ | Robust loss (e.g., Huber) |