2026 - Thesis - 4D MapAnything

Feed-forward 4D Reconstruction


References



1. Metric-Scale 3D Reconstruction

Definition

  • “Metric-scale” means that the reconstructed 3D scene is expressed in real-world physical units (e.g., meters) rather than up-to-scale or normalized units.

  • Up-to-scale reconstruction recovers only the structure’s shape, not its real scale factor.
    Example: a room reconstructed as either 3 m or 30 m wide appears identical.
  • Metric-scale reconstruction estimates a global scale parameter that converts the up-to-scale 3D structure into real-world dimensions.
  • In MapAnything, a global metric scale token $m$ is predicted such that: \(X_i^{metric} = m \cdot X_i^{\sim}\) where $X_i^{\sim}$ is the up-to-scale reconstruction.

2. Pipeline

Objective

  • To perform feed-forward, metric-scale 4D reconstruction of dynamic scenes using a Time-Varying Generalized Camera model.

Concept

  • MapAnything models static multi-view geometry using a generalized camera (fixed light-ray set).

\(\mathcal{C}(t) = \{ (p_i(t), d_i(t)) \}\) where each pixel corresponds to a ray with a time-dependent origin $p_i(t)$ and direction $d_i(t)$.


3. Input Design

Input Type Description Example
Image sequence $I_t$ Temporal image frames or asynchronous event accumulations RGB / event frames
Geometric priors Extrinsics, intrinsics, sparse depth, IMU VICON, COLMAP, SLAM
Time label $t$ Frame or event timestamp μs or ms
Optional motion prior Scene flow or optical flow initialization RAFT3D, DynamicStereo

4. Model Architecture

Encoder

  • Vision Transformer backbone (e.g., DINOv2 / ViT-L)
  • Temporal Positional Encoding (TPE): \(\text{TPE}(t) = \sin(\omega t + \phi)\)
  • Token = image patch + geometric features + time embedding

Transformer Core

  • Based on MapAnything’s Alternating-Attention Transformer
  • Extended to cross-attention over (views × time)
  • Introduce motion-aware attention block modeling $\partial p / \partial t$

Decoder Heads

  • Ray directions: $R_i(t)$
  • Depths along rays: $D_i(t)$
  • Camera poses: $P_i(t) = [R_i(t), T_i(t)]$
  • Global scale: $m(t)$
  • Scene flow: $F_i(t) = X_i(t+\Delta t) - X_i(t)$
  • Temporal clustering: cluster latent features by motion patterns

5. Loss Functions

Loss Meaning Expression
$L_{geom}$ Geometric consistency (RDP structure) As in MapAnything
$L_{metric}$ Metric scale consistency $|\log m_{\text{pred}} - \log m_{\text{gt}}|$
$L_{flow}$ Temporal scene flow consistency $|X_t + F_t - X_{t+\Delta t}|$
$L_{cluster}$ Motion clustering Contrastive or self-distillation
$L_{smooth}$ Temporal smoothness $|X_{t+1} - 2X_t + X_{t-1}|$
$L_{mask}$ Dynamic mask supervision BCE or uncertainty weighting
  • Adaptive Robust Loss is used to weight all residuals (see Section 8).

6. Output

  • The model outputs: \(\mathcal{O} = \{ X_i(t), P_i(t), F_i(t), m(t), C_i(t) \}\)

Where:

  • $X_i(t)$: metric 3D points
  • $P_i(t)$: camera poses
  • $F_i(t)$: scene flow
  • $m(t)$: metric scale
  • $C_i(t)$: motion clusters

7. Training Strategy

Stage Goal Data
1. Static pretraining Learn static geometry and scale MapAnything datasets
2. Temporal alignment Temporal consistency learning Dynamic Replica / TartanAirV2
3. Spatio-temporal fine-tuning Train flow and clustering heads Synthetic dynamic datasets
4. Self-supervised finetuning Real data adaptation Photometric + geometric consistency

8. Adaptive Robust Loss

Core Idea

  • Adaptive Robust Loss is a general parametric loss family that unifies and generalizes $L_2$, $L_1$, Cauchy, Geman–McClure, and other robust losses under a single formulation.

General form

\[L(x; \alpha, c) = \frac{|\alpha - 2|}{\alpha} \left( \left( \frac{(x/c)^2}{|\alpha - 2|} + 1 \right)^{\alpha/2} - 1 \right)\]

where:

  • $\alpha$: shape parameter controlling robustness
  • $c$: scale parameter controlling residual normalization

Special cases

$\alpha$ Equivalent Loss Behavior
2 L2 (Gaussian) Sensitive, fast convergence
1 L1 (Laplacian) Moderately robust
0 Cauchy Heavy-tailed, robust
-2 Geman–McClure Very robust
$\to \infty$ Welsch / Tukey Bounded, ignores outliers

Adaptive Mechanism

$\alpha$ and $c$ are learnable via backpropagation, allowing the model to automatically tune its robustness:

  • At early stages: smaller $\alpha$ → higher robustness
  • Later: $\alpha \to 2$ → smoother convergence

  • This adaptivity stabilizes training on long-tailed error distributions common in visual geometry.

Benefits

  • Unifies all standard robust losses
  • Automatically adjusts to dataset noise level
  • Requires no manual tuning
  • Widely used in SLAM, SfM, VO, and 3D reconstruction tasks

9. Evaluation Metrics

Category Metric
Geometry Depth rel, τ, ATE RMSE
Temporal consistency Flow EPE, Temporal Chamfer distance
Clustering Adjusted Rand Index (ARI), mIoU
Scale Relative Scale Error
Overall Reconstruction quality over time




Topics


References