2025 - Master Thesis and Project

4D Reconstruction, Connectomics, CVG, USZ, SCAI


  • ViT, DINOv3, Semantic-SAM, Diffusion, OpenScene, NExF, DiT,
  • 2025 - 3D Vision
  • 2025 - Seminar in Visual Computing
  • 2025 - Mixed Reality

  • Complexity of Explicit methods stems from the squared dependency of geometric optimization, Complexity of Implicit methods stems from the linear dependency of neural inference. VGGT & St4RTrack, through an Alternating Attention mechanism aligned with world coordinates, rewrite the multi-view geometry problem as a linearly parallelizable forward computation problem.

  • All deterministic models β†’ Cannot generate multiple plausible deformation trajectories, however, The goal of 4D Reconstruction is not merely to reconstruct the surface deformation of a video, but rather to capture the underlying deformation space.
- Diffusion-generated high-frequency details (hair, fabric folds) often adhere to existing surfaces
- Deformable fields ensure that these details move consistently with body or surface movement


References


Polynomial vs. Recursive Construction (Essential Differences for ML & Geometry)

Aspect Polynomial (Analytic / Global Form) Recursive (de Casteljau / Local Form)
Influence of Control Points Global β€” one control point affects the entire curve Local β€” each segment depends only on nearby control points
Function Complexity High-complexity global polynomial Simple repeated linear interpolation
Learning Stability Unstable (global coupling β†’ noisy gradients) Stable (local structure β†’ smooth gradients)
Regularization Weak β€” no inherent geometric constraints Strong β€” recursive structure acts as built-in regularizer
Overfitting Risk High Low
Compatibility with ML Poor for displacement or dynamic motion Excellent for neural models (diffusion, deformation, 4D trajectories)
Extension to High Dimensions Difficult (global interactions) Easy (local updates generalize to 3D/4D motion)
Relation to Other Priors β€” Naturally compatible with B-Splines (local support) and natural parametrization (arc-length consistency)


Experiments Notes 2025


Products

  • moondream



Open Problems in Traditional 3DV

Aspect Unresolved Pain Point Why Existing Methods Are Not Enough Potential Directions to Solve Existing Efforts (Yes / Partial / No)
1. Dynamic Scenes (non-static, non-rigid motion) Traditional SfM assumes static scenes; moving or deformable objects are treated as outliers. Even with feature-metric refinement, rigid-scene assumptions fail; motion causes inconsistent correspondences and reconstruction errors. Introduce dynamic NeRF, scene-flow-based SfM, or implicit dynamic field modeling to capture motion and deformation over time. Partial. Dynamic NeRFs and neural SDF-flow methods partially address non-rigid motion.
2. Illumination / Appearance / Time Variance CNN features degrade under extreme lighting, weather, or long-term changes. Feature spaces are static and lack temporal conditioning or appearance adaptation. Develop time-conditioned encoders, appearance flow fields, or illumination-invariant feature spaces within SfM. Partial. Some dynamic NeRF variants model lighting or appearance change, but classical SfM lacks such temporal modeling.
3. Extreme Viewpoint / Wide Baseline Large viewpoint changes break local feature consistency and matching stability. Descriptors cannot generalize across large baselines, occlusions, or drastic view changes. Combine semantic, language, or diffusion priors for semantic-aware SfM that matches beyond local appearance. Partial. Semantic-aware 3D reconstruction and NeRF methods show progress, but integration with classical SfM remains rare.
4. Sparse–Dense Gap SfM yields sparse geometry; dense methods (MVS) use incompatible representations. Sparse and dense optimization objectives differ, preventing unified reconstruction. Employ unified implicit fields (feature fields, SDF, Gaussian splatting) that bridge sparse and dense representations. Partial / Emerging. Implicit and Gaussian-based fields begin to unify sparse–dense paradigms.
5. Geometry–Semantic Alignment Traditional SfM reconstructs only geometry, ignoring semantic consistency. Lacks semantic identity or part-level alignment, limiting high-level scene understanding. Integrate vision-language or semantic embeddings (e.g., CLIP, DINOv3) and enforce cross-view semantic regularization. Partial. Semantic-aware 3D reconstruction is growing but still limited for geometry-based SfM.
6. Long-Term Consistency & Memory Per-scene optimization causes drift; long-term or cross-session consistency is absent. No temporal memory; reconstructions over time remain inconsistent. Incorporate state-space models, latent geometry flow, or temporal latent dynamics for consistent long-term modeling. Partial / Emerging. Some dynamic NeRFs and latent-flow models handle temporal coherence, but not integrated into SfM frameworks.


Projection for the Semantic Prior

Head Type Alignment Level Semantic Context Output Computation Recommended Use
Linear Head Patch-wise Local semantics 3D patch blocks ⚑ Fast (O(NΓ—C)) Gaussian Fur, fast inference
DPT Head Multi-layer Global + contextual Dense 3D map 🧠 3–5Γ— heavier Full 4D reconstruction, tracking


In the Latent Space - DINOv3 (ViT-S/14)

Stage Symbol Description Shape
Input image – [B, 3, H, W] Β 
Patch embedding 14Γ—14 patch size β†’ flatten [B, N, 384], where N = (H/14)*(W/14) Β 
Output feature dim C_enc = 384 semantic token embedding Β 


St4RTrack (DUSt3R backbone)

Parameter Typical value (from dust3r/configs) Meaning
enc_embed_dim 768 or 1024 dimension of encoder tokens
dec_embed_dim 512 or 768 dimension expected by the decoder
patch_size 16 one token ↔ 16Γ—16 input pixels




2D Vision SSL Supervision


3D / 4D Reconstruction and Tracking



Conceptual Distinction

β€œDiffusion completion β‰  Dual-branch interaction modeling.” Although Diffusion models are powerful at image or feature completion (e.g., inpainting, conditional generation), their fundamental goal differs from that of a dual-branch Transformer. Diffusion reconstructs plausible samples from noise (probabilistic generation), while a dual-branch Transformer explicitly learns correspondences and relationships between two inputs (deterministic interaction).

Backbone Type Example Models Characteristics Advantages Limitations
U-Net (CNN) DDPM, Stable Diffusion v1/v2 Encoder–decoder with skip connections Strong local spatial modeling; efficient; easy to train Limited global context; harder to scale
Vision Transformer (ViT) DiT (Diffusion Transformer, 2023) Patch-based transformer Better global attention, high scalability, simple structure Expensive to train; requires large data
Swin Transformer / Hierarchical ViT UViT, UDiT Combines U-Net hierarchy + ViT attention Preserves multi-scale info while gaining Transformer power More complex to tune
ConvNeXt / ResNet Some lightweight diffusion variants CNN-based but more modern than U-Net High efficiency, simple to integrate Limited generative flexibility
Graph Neural Network (GNN) Molecular / 3D Diffusion models (e.g., EDM, GeoDiff) Models relations over nodes/atoms Structured data modeling (e.g., point cloud, molecules) Not suited for grid data
Implicit MLP (NeRF-style) Score-based NeRF / Diffusion Fields Continuous signal representation Excellent for 3D continuous spaces Slow, not scalable for large images


Dealing With Continuous 3D Input Datasets

+---------------------------+
|  Discrete Token Sequence  |
|  ["age", "BMI", "asthma"] |
+------------+--------------+
             |
             v
+---------------------------+
|  Token Embedding Matrix   |
+---------------------------+
             |
             v
+---------------------------+
| Multi-Head Self-Attention |
|     O(N^2) complexity     |
+---------------------------+
             |
             v
+---------------------------+
|     Feed-Forward Layer    |
+---------------------------+
             |
             v
+---------------------------+
|   Output: Token-to-Token  |
|     symbolic reasoning    |
+---------------------------+

Continuous Modal Inputs Include:

  • images
  • depth maps
  • 3D point clouds
  • 3D meshes
  • medical waveforms (ECG, PPG, Doppler)
  • neural biosignals (EEG/MEG)

  • Issue: Discrete symbolic models cannot directly represent geometry.

Information Loss

 Continuous Data (image/depth/EEG/point cloud)
           |
           v
+------------------------------+
| Continuous Encoder (CNN/ViT) |
+------------------------------+
           |
           v
   Project to k "fake tokens"
           |
           v
+------------------------------+
| Traditional LLM Transformer  |
|   (expects symbolic tokens)  |
+------------------------------+
           |
           v
   LLM pretends to "understand"
   β†’ But geometry/topology lost

Attended Master Thesis Project Defense at CVG

  • 01 Sep 2025 - πŸ“ Reconstructing Complete Garments with Foundation Models
    • Pattern Prediction on Fabric Recognition
    • β€˜Garments are both cultural artifacts and engineered products, but most generative models produce visuals that cannot be manufactured. This thesis introduces a foundation model for pattern-centric garment generation, where outputs are sewing patternsβ€”panels, seams, and annotationsβ€”ready for CAD and simulation. A new tokenizer and multimodal dataset enable structured decoding from text or image inputs in a unified framework. In parallel, we investigate fabric recognition from large-scale product data, underscoring the challenge of linking garment shape to material behavior. Experiments show state-of-the-art pattern prediction, strong generalization, and predictable scaling. Together, these contributions move digital fashion toward simulation-ready, fabrication-oriented design.’
    • 2025 - AIpparel: A Multimodal Foundation Model for Digital Garments
    • 2025 - Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates
  • 10 Sep 2025 - An Interactive, Foundation-Model-Empowered Video Annotation Interface for Constructing a Challenging Video Object Segmentation Dataset
    • SAM 2, DINOv2, GPT-4o, πŸ“ real-time Annotation
    • demo - nutsh
  • 02 Oct 2025 - VSLAM-LAB: A Comprehensive Framework for Visual SLAM Baselines and Datasets, pixi
    • 2024 - Gaussian Splatting SLAM, demo
  • 09 Oct 2025 - Controllable Visual Generation using 3D prior
    • HIL D 55.2, ETH HΓΆnggerberg, 14:00
  • StreamSplat: A Framework for Self-Supervised, Online Novel View Synthesis
    • Friday Oct 10th, 15:00-15:30 Zurich Time
  • (Uncertainty-Aware 3D Mapping, Monday, October 13th, Zoom)
  • From Sensors to Solutions: Permanent Laser Scanning in πŸ“ Environmental Monitoring
    • Prof. Dr.-Ing. Daniel Czerwonka-SchrΓΆder, 22 Oct 2025, 4:45 p.m, HIL D 53, HΓΆnggerberg
  • (iTwRL: Interactable Digital Twin for Reinforcement Learning, Nov 17th, 2pm, Zoom)
  • Learning 3D Human Foundation Models: A Data Request, Prov. Siyu Tang, Mon, 24-Nov-2025, 13:00–14:00, HG D 16.2
    • Computational methods to model human motion and behavior from visual inputs in real-world environments, πŸ“ Non-rigid 4D Reconstruction and Tracking
  • Guided Monocular Depth Estimation, Mon, Dec 8th, 2pm, CNB G 110 and Zoom, Sophie Selgrad


During Training

Implementation style Number of backward calls Gradient behavior Characteristics
(loss1 + loss2).backward() 1 Computes gradients jointly Simpler, but combined gradients are less controllable
Separate backward() calls 2 Computes gradients individually Better suited for tasks requiring different weighting or multi-branch networks (e.g., semantic + diffusion loss)

πŸ“ Semantic vs. Photometric Consistency

  • In traditional SfM / MVS / NeRF pipelines, pixel correspondence is established by enforcing photometric consistency across views:
\[L_{\text{photo}} = \| I_t(p) - I_{t'}(w(p)) \|,\]
  • where $I_t(p)$ is the pixel intensity at location $p$ in frame $t$, and $w(p)$ is the projection of that pixel into the target frame $t’$ using the estimated geometry.

  • In contrast, DINOv3, built on the Vision Transformer (ViT), replaces raw pixel comparison with semantic feature consistency:
\[L_{\text{semantic}} = \| f_{\text{DINO}}(I_t(p)) - f_{\text{DINO}}(I_{t'}(w(p))) \|,\]
  • where $f_{\text{DINO}}(\cdot)$ denotes patch-level semantic embeddings extracted by DINOv3.

  • Because these $f_{\text{DINO}}$ features are patch-level and semantically stable, they remain consistent under viewpoint changes, illumination variations, and partial occlusionsβ€” enabling robust cross-frame and cross-view alignment beyond raw photometric matching.


  • From 2D to 3D/4D reconstruction is a highly ill-posed inverse problem, Projection model:
\[I(x, y) = \Pi(X, Y, Z)\]
  • where the projection operator $\Pi$ maps a 3D point in the world coordinate space to a 2D pixel on the image plane

  • The inverse problem is:

\[\text{Given } I(x, y), \; \text{solve for } (X, Y, Z).\]
  • Two-dimensional pixel observations alone cannot uniquely deduce the true three-dimensional structure, so traditional methods rely heavily on geometric priors and multi-view constraints


Traditional Pairwise Pipeline (O(TΒ²))

      I1 β€”β€” I2 β€”β€” I3 β€”β€” ... β€”β€” IT
       ↕    ↕    ↕          ↕
      (I1,I2), (I1,I3), (I2,I3), (I3,I4), ... (IT-1,IT)
       └────────────── Dense Pairwise Matching β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Anchor-based Pipeline (O(T))

       I1 ──────────▢ I2
        β”‚            β”‚
        β”‚            β–Ό
        │──────────▢ I3
        β”‚            β”‚
        β”‚            β–Ό
        │──────────▢ I4
        β”‚            β”‚
        β”‚            β–Ό
        └──────────▢ IT
   (fixed anchor frame)

β†’ Each frame j only forms one pair (I1, Ij)
β†’ Each pair predicts (X^1_j, X^j_j) in the same world coordinate


Diffusion Models vs. Flow Matching

Problem Type Diffusion Models Flow Matching
Non-rigid deformation Implicitly learns temporal consistency through noise-to-structure denoising Requires explicit motion supervision
Photometric inconsistency Learns semantic-level feature representations beyond pixel matching Still relies on local intensity differences in continuous flow fields
Sparse-view limitation Uses generative priors to β€œcomplete” missing geometry Requires sufficient observation constraints
Ill-posed inverse problem Models latent structure distributions probabilistically via diffusion Deterministic ODE mapping, sensitive to noise
Temporal consistency Implicit diffusion process provides natural temporal smoothness Lacks explicit temporal regularization


Multi-View Matching

Problem Definition

  • Given a set of frames
    \(\{ I_1, I_2, \dots, I_T \},\)
  • and feature points extracted from each frame
    \(\{ f_i^t \},\)
  • the goal is to find cross-frame correspondences
    \(\pi: (f_i^t) \mapsto (f_j^{t+k}),\)
  • such that they represent the same real-world 3D point.

Why It Is NP-hard

  • This problem is equivalent to graph matching:

    • Each frame’s feature points form a node set.
    • Correspondences between frames are edges.
    • Matches must satisfy both geometric (epipolar) and temporal consistency constraints.
    • The optimal matching minimizes:
    \[\min_{\pi} \sum_{t,k} \| P_t(f_i^t) - P_{t+k}(f_{\pi(i)}^{t+k}) \|^2,\]
  • where $P_t$ is the projection matrix. When the number of views exceeds two, the search space grows exponentially. Multi-view matching can be reduced to the Quadratic Assignment Problem (QAP), which is a classical NP-hard problem.

Simplified and Practical Approaches

Method Principle Time Complexity Integration Module
Soft Attention Matching (Transformer) Replace hard matching with attention weights O(TΒ·NΒ²) St4RTrack Encoder
Epipolar Constraint Filtering Geometric pre-filtering before soft match O(N log N) Projection Stage
Hough-Voting / DINO Semantic Alignment Use semantic token similarity for weakly supervised matching O(N) per frame DINO Semantic Prior
Graph Cut / Sinkhorn Normalization Approximate discrete matching via differentiable assignment O(NΒ³) Differentiable Alignment


Ground Truth for 4D Reconstruction

Dataset Scene Type Provided Ground Truth Used for Supervision
Point Odyssey (PO) Dynamic scenes with known camera trajectory and per-vertex motion (long-term tracks) 3D vertex positions over time (x, y, z, t) and camera extrinsics βœ… Tracking Branch: uses 4D vertex trajectories
βœ… Reconstruction Branch: uses per-frame mesh depth
Dynamic Replica (DR) Real scanned scenes with dynamic motion (camera + objects) Known static/dynamic meshes and GT camera pose βœ… Reconstruction: per-frame geometry
βœ… Tracking: vertex correspondences
Kubric Synthetic multi-object scenes Depth map, instance segmentation, camera pose βœ… Reconstruction only (depth/geometry); weak tracking

Step-by-Step: Constructing GT Pointmaps

Example: Point Odyssey

Each frame provides the scene mesh vertices in world coordinates:

\[V_t = \{ v_k^t \in \mathbb{R}^3 \mid k = 1, \dots, N \}\]

where each $v_k^t$ is a 3D vertex position at time $t$.

For each image pixel $(u, v)$, find its corresponding mesh vertex (via rasterization or ray casting):

\[X_t(u,v) = \text{mesh2image}(V_t)\]

This is the ground-truth pointmap at time $t$.


1. Tracking Branch Supervision

For the same vertex across time $i \to j$:

\[X^i_j(u,v)^{GT} = V^j_k - V^i_k\]

provided by the dataset’s 4D trajectories.


2. Reconstruction Branch Supervision

Per-frame depth maps or meshes provide supervision for:

\[X^j_j\]

which represents geometry reconstruction at time $j$.

All GT pointmaps are expressed in a unified world coordinate frame, transformed using the first frame’s camera extrinsics.


Aligned Results β€” How They Are Computed

  • During training and evaluation, predicted and GT pointmaps may differ in scale, rotation, or translation.
    Alignment ensures they are comparable.

Step 1: Global Median Scale Alignment (default)

  • For each sequence, the predicted and GT pointmaps are scale-normalized:
\[s = \text{median}\left( \frac{ \| GT_i \| }{ \| Pred_i \| } \right), \quad Pred \leftarrow s \cdot Pred\]

Step 2: SIM(3) Alignment (used in evaluation)

  • In Tables 4 and 5, an additional similarity transform alignment (scale + rotation + translation) is applied using the Procrustes algorithm:
\[\min_{R, t, s} \sum_i \| GT_i - (s R Pred_i + t) \|^2\]

The aligned prediction is then used to report metrics such as APD₃ᴰ and EPE.


In summary:

  • Ground-truth pointmaps are rasterized from dataset-provided 4D meshes, expressed in the world coordinate frame.
    During evaluation, predictions are scale- or SIM(3)-aligned to these GT pointmaps before computing accuracy metrics.


Explicit vs. Implicit 3D Geometry Computation

Aspect Explicit (Geometric Pipeline) Implicit (Neural Pipeline)
Example Setup Suppose you have a video with 10 frames. The same 10 frames are input to a neural model.
Pairwise Matching You must construct 10 Γ— 10 = 100 image pairs for feature matching. No explicit pair construction β€” attention layers automatically learn inter-frame relations.
Feature Matching Cost Each pair requires thousands of RANSAC iterations to reject outliers. Correlation across all frames is learned once through self-attention and optimized end-to-end.
Pose Estimation Each image pair needs PnP / Essential Matrix estimation to recover relative camera poses. The network implicitly infers all camera poses from global attention and latent camera tokens.
Global Optimization Requires Bundle Adjustment (BA) over thousands of variables (poses + 3D points). A single forward pass of the network jointly refines all poses and points.
Computation Dependency Each stage depends on the previous step (matching β†’ pose β†’ triangulation β†’ BA), making the process sequential and non-parallelizable. Entire pipeline is feed-forward; all operations are differentiable and GPU-parallelizable tensor computations.
Computational Complexity Typically O(TΒ²) to O(TΒ³) due to pairwise matching and optimization across frames. Approximately O(T) with Alternating-Attention (frame-wise + global), scalable to hundreds of frames.
Runtime Minutes to hours, depending on number of frames and optimization steps. Milliseconds to seconds for full reconstruction.
Memory Usage High β€” needs to store large Jacobians, keypoints, and pairwise constraints. Moderate β€” mainly token embeddings and attention maps.
Output Camera intrinsics/extrinsics, sparse or dense 3D structure after optimization. Cameras, depth maps, and dense world-space point maps produced directly from the network.
Parallelizability Low β€” iterative geometric solvers are inherently serial. High β€” all computations are matrix multiplications on GPU.
Interpretability High (based on explicit geometry equations). Lower interpretability β€” geometry is implicitly encoded in network weights.
Representative Methods SfM, COLMAP, MVSNet, NeRF (explicit camera poses). VGGT, St4RTrack, MapAnything (implicit world-frame prediction).


DL For 3D Reconstruction

Bottleneck Source Limitation of Classical Geometry Methods Deep Learning Improvement Strategy
Dynamic Scenes Assume the scene is static Introduce temporal modeling (RNN / GRU / Transformer) and learn deformation fields (e.g., D-NeRF, HyperNeRF)
Sparse Viewpoints Insufficient view redundancy Use pretrained priors, shape priors, or diffusion priors to fill in missing geometric information
Real-time Requirement Optimization is slow and iterative Replace optimization with feed-forward neural networks and learned depth estimators
Weak or No Supervision Depend on accurate labels and calibration Train via photometric consistency and self-supervised losses (e.g., Monodepth, NeuralRecon)
Complex Illumination and Reflection Simplified lighting model (Lambertian assumption) Learn implicit neural representations that model reflection and BRDF properties
Temporal Consistency Treat each frame independently Apply ConvGRU, flow matching, or diffusion-based temporal smoothing to maintain cross-frame consistency


Representative Methods

Model Year Core Idea
NeRF 2020 Learns an implicit volumetric rendering function \(f_\theta(x,d)\) to recover high-quality 3D scenes without traditional geometric optimization.
D-NeRF / HyperNeRF 2021 Extends NeRF by learning non-rigid deformations along the temporal dimension, enabling 4D dynamic reconstruction.
4D-LRM / 4D-Fly 2025 Combines large-scale pretraining and temporal-consistency modules to reconstruct scenes from any view at any time.


Random Matrix

  • Eigenvalues of large random matrices are statistically distributed across different systems

Wigner’s Semicircle Law

  • ρ(Ξ») = (1 / 2πσ²) √(4σ² βˆ’ λ²), |Ξ»| < 2Οƒ

Marčenko–Pastur Law

  • $\rho(\lambda) = \frac{1}{2\pi\sigma^{2} c \lambda} \sqrt{(\lambda_{+} - \lambda)(\lambda - \lambda_{-})}, \quad \lambda_{\pm} = \sigma^{2}(1 \pm \sqrt{c})^{2}$


Benchmarks and SOTAs

1. 4D

2. 3D

Key Contributions

[2000s] Classical 2D Stitching
   - Euclidean / Affine / Homography
   - Used in panoramas, satellite mosaics, medical imaging
   - Fast, lightweight, real-time

   Euclidean (3 DOF)
   β–’ β†’ β–’
   Rigid rotation + shift
        |
        v
   Affine (6 DOF)
   β–’ β†’ β¬ 
   Parallel preserved (shear, scaling)
        |
        v
   Projective (8 DOF)
   β–’ β†’ ⬳
   Perspective distortion (vanishing point)
-------------------------------------------------
        |
        v
[2010s] Multi-View Geometry
   - SfM (Structure from Motion), SLAM
   - SE(3) rigid motion + Bundle Adjustment
   - Full 3D scene reconstruction (static environments)
        |
        v
[2020s] Neural Implicit Representations
   - NeRF (Neural Radiance Fields)
   - Gaussian Splatting, Dynamic NeRF
   - Rich photorealistic 3D, supports dynamics
        |
        v
[2025 β†’ ] Transformer & Foundation Models
   - VGGT (Geometry β†’ Transformer sequence modeling)
   - DINOv3 (7B SSL backbone, dense visual features)
   - Replaces manual geometry β†’ universal representations
   - Powers Pixel 10 AI (Gemini Nano + Tensor G5)



Neural Differential Equations

1. Core Idea

  • Neural Differential Equations (NDEs) generalize neural networks to continuous depth.
  • Instead of discrete layer updates, the hidden state evolves continuously over time according to an ordinary differential equation (ODE):
\[\frac{d\mathbf{z}(t)}{dt} = f_\theta(\mathbf{z}(t), t), \quad \mathbf{z}(t_0) = \mathbf{z}_0\]
  • The solution is obtained by integrating over time:
\[\mathbf{z}(t_1) = \mathbf{z}(t_0) + \int_{t_0}^{t_1} f_\theta(\mathbf{z}(t), t)\,dt\]

2. Comparison with Standard Neural Networks

Property Standard NN Neural Differential Equation
Structure Discrete layers Continuous dynamics
Forward pass $h_{k+1} = f_\theta(h_k)$ $\frac{dh}{dt} = f_\theta(h,t)$
Depth Fixed Continuous
Backpropagation Chain rule Adjoint sensitivity method
Interpretation Layer mapping Continuous-time dynamical system

3. Training via the Adjoint Method

  • Gradients are computed by solving an adjoint ODE backward in time:
\[\frac{da(t)}{dt} = -a(t)^\top \frac{\partial f_\theta(\mathbf{z}(t),t)}{\partial \mathbf{z}}, \quad \frac{dL}{d\theta} = -\int_{t_1}^{t_0} a(t)^\top \frac{\partial f_\theta(\mathbf{z}(t),t)}{\partial \theta}\,dt\]
  • This allows memory-efficient gradient computation since intermediate states do not need to be stored

4. Variants

  • Neural ODE: Deterministic dynamics
  • Neural SDE: Stochastic systems with noise
  • Neural PDE: Parameterized partial differential equations
  • Hamiltonian NN: Conserves physical energy
  • Controlled DE: Handles continuous control inputs

5. Physical Interpretation

  • The function ( f_\theta ) acts as a learnable vector field that defines how the system evolves in time
  • This enables learning unknown physical dynamics directly from data:
\[\frac{\partial u}{\partial t} = f_{\text{known}}(u) + f_\theta(u)\]


Speed up Your Training in Multiple Ways

Category Technique / Concept Core Idea (One Line) Example Hardware / Framework
Kernel Optimization AI Kernel Auto-Tuning (Triton / AutoTVM) Automatically generate and fuse optimal CUDA kernels for specific tensor shapes. NVIDIA H100, PyTorch 2.5, Triton 2
Memory Efficiency FlashAttention 2 / Fused Ops Combine attention and softmax in a single kernel to reduce memory and launch overhead. A100 / H100 / RTX 4090
Precision Optimization FP8 / INT8 Quantization-Aware Training Use ultra-low precision arithmetic with adaptive scaling for faster, energy-efficient training. NVIDIA Hopper, AMD MI300X
Graph Compilation TorchInductor / XLA / MetalFX Graph Capture Compile dynamic graphs into optimized static kernels for faster runtime. PyTorch 2.x, TPU v6e, Apple M4 Max
Diffusion Optimization Fused Diffusion Sampling Merge denoising and upsampling into one fused kernel for diffusion-based training. DiffusionRefine, Stable Diffusion Turbo
Adaptive Fine-Tuning LoRA / QLoRA / BitNet Adapters Parameter-efficient fine-tuning for large models on limited hardware. RTX 6000 Ada, M2 Ultra, Edge TPU
Sequence Modeling Mamba SSM / Linear Attention Replace quadratic attention with state-space or linear-time models for long-sequence efficiency. Transformer Engine (H100), FlashMamba
Hardware Co-Design Grace Hopper / TPU v6e / MI300X Unified CPU–GPU/TPU memory design enabling zero-copy tensor access. NVIDIA GH200, Google TPU Pods
Distributed Optimization FSDP + ZeRO + NVLink 4.0 Fully sharded data parallelism and high-speed interconnect for multi-GPU scaling. DGX H100 Cluster, NVSwitch
Emerging Paradigm Neural Compilation & Auto-Scheduling Learn to generate compute graphs and schedule execution automatically. TVM Unity, Modular Mojo


Models

Points β†’ Delaunay Triangulation (Triangles)
      β—‹         ○───────○
       \       / \     /
        ○─────○───○───○
       / \     \ /     \
      β—‹   ○─────○──────○


Points β†’ Voronoi Diagram (Cells)
      β—‹     β”‚     β—‹
     β”Œβ”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”
     β”‚ Cell β”‚ Cell β”‚
  ○──┼──────┼──────┼──○
     β”‚ Cell β”‚ Cell β”‚
     β””β”Όβ”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”Όβ”˜
      β—‹     β”‚     β—‹


Points β†’ Poisson / Ξ±-shapes (Smooth Surface)
        ●──────────●
      β•±              β•²
    ●                  ●
    β•²                  β•±
      ●──────────────●


Points β†’ Volumetric / TSDF (Voxel Grid)
   β–“β–“β–“β–“β–“
   β–“β–ˆβ–ˆβ–ˆβ–“    Each cube = voxel
   β–“β–“β–“β–“β–“


Points β†’ Implicit Fields (SDF / NeRF)
   f(x,y,z) = 0  β†’ Surface
   Continuous function learned by NN
   "Shape emerges from equations"


Points β†’ Modern Neural Models (GS / Transformer)
   ● Gaussian Splatting β†’ soft blobs in 3D
   ● VGGT / MonST3R / PanSt3R β†’ End-to-end feed-forward 3D/4D
   ● NeRF β†’ Radiance fields, view-dependent rendering


Motion


Structure-from-Motion (SfM) Pipeline

Input: Multiple images (Image Sequence)
   ↓
1️⃣ Feature Extraction  
   - Detect keypoints and compute descriptors  
   - Methods: SIFT, ORB, SuperPoint, D2-Net  
   ↓
2️⃣ Feature Matching  
   - Find correspondences across images  
   - Techniques: Nearest Neighbor, RANSAC, StereoGlue  
   ↓
3️⃣ Camera Motion Estimation  
   - Estimate relative poses using Essential / Fundamental Matrix  
   - Recover camera extrinsics (Rotation R, Translation t)  
   ↓
4️⃣ Triangulation  
   - Back-project matched points  
   - Compute 3D scene points (sparse point cloud)  
   ↓
5️⃣ Bundle Adjustment (BA)  
   - Global non-linear optimization  
   - Refine camera poses and 3D points  
   - Minimize reprojection error  
   ↓
6️⃣ Output  
   - Optimized 3D point cloud (sparse or dense)  
   - Camera trajectory (motion path)  


Visual SLAM Pipeline

Input Images (RGB / RGB-D / Stereo)
        ↓
Front-End Tracking
   - Feature Extraction (ORB, SuperPoint)
   - Feature Matching (KLT, StereoGlue)
   - Motion Estimation (PnP, Essential Matrix)
        ↓
Back-End Optimization
   - Bundle Adjustment (BA)
   - Sliding Window Optimization
        ↓
Loop Closure
   - Place Recognition
   - Pose Graph Optimization
        ↓
Mapping
   - Sparse Map (Point Cloud)
   - Dense Map (Depth / Voxel / Mesh)
   - Semantic Map (Object / Scene Labels)
        ↓
Output: Robust Trajectory + Map


Visual Computing

2D β†’ 3D Projection World                     Multi-View Segmentation World
═══════════════════════════════              ══════════════════════════════════
Pixel Point  β†’  Camera Intrinsics β†’          Multi-View Image β†’  Camera Extrinsics β†’
Corrected by Distortion β†’ Project to 2D      Align Views Consistently β†’ Back-Project to 3D
    ↓                   ↓                          ↓                       ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pixel Coordβ”‚ β†’ β”‚ Metric Ray β”‚      vs.     β”‚ Seg. Mask  β”‚ β†’ β”‚ 3D Point Cloud β”‚
β”‚ (u,v)      β”‚   β”‚ (K Matrix) β”‚              β”‚ (2D Image) β”‚   β”‚  or Voxels     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
    ↓                   ↓                          ↓                 ↓
Distortion-Free    Accurate Geometry           Consistent 3D     Semantic Labels
Projection         Pixel β†’ Metric Space        Reconstruction   in 3D Space


Summary:
1. Intrinsics: Ensure pixels map to correct metric coordinates
2. Extrinsics: Align multi-view cameras consistently
3. Distortion Params: Correct lens errors
4. Projection: World point β†’ Image point
5. Back-Projection: Pixel + depth β†’ World point
6. Goal: Lift 2D segmentation masks into 3D semantic segmentation

Camera = Projector (2D Screen View)
Extrinsics = GPS for Camera Pose
Segmentation = Paint Mask that Becomes 3D Object


Classical SfM vs. VGGT

 Classical SfM / MVS World                   VGGT World
═══════════════════════════════════         ════════════════════════════════════
Find Keypoints β†’ Match Pairs β†’              Drop Images β†’ Transformer Thinks β†’
Estimate Pose β†’ Triangulate β†’               One Forward Pass β†’ Geometry Pops Out
Optimize BA β†’ Wait Forever                  (Pose, Depth, Points, Tracks in ms)
     ↓                   ↓                          ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Feature     β”‚ β†’ β”‚ Epipolar     β”‚    vs.    β”‚ Transformer   β”‚ β†’ β”‚ Unified Outputsβ”‚
β”‚ Matching    β”‚   β”‚ Geometry     β”‚           β”‚ Global Contextβ”‚   β”‚ (Pose+Depth+3D)β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓                   ↓                          ↓                    ↓
Fragile Matches     Heavy Optimization         Robust Priors        Instant Geometry
(SIFT/SuperPoint)   (Bundle Adjustment)        Learned Attention    Feed-forward Only

Hybrid approaches:
1. Use classical SfM to bootstrap intrinsics β†’ fine-tune with VGGT outputs
2. Combine hand-crafted geometry checks (epipolar) with learned global priors

Classical SfM = Puzzle Builder with Thousands of Pieces (slow, error-prone)
VGGT = Instant Polaroid Printer that Prints 3D (fast, all-in-one)


VGGT (Learning-driven)

Input Images
   ↓
Patch Embedding (DINO)
   ↓
Camera Tokens + Self-Attention
   ↓
Feed-forward Transformer
   ↓
Outputs (Direct Prediction)
   β€’ Intrinsics K
   β€’ Extrinsics [R | t]
   β€’ Depth Maps
   β€’ Point Maps
   β€’ 3D Tracks


Why Squared (L2) Loss

1. Mathematical

  • Squaring makes the error smooth, continuous, and differentiable, which is required for gradient-based optimization

  • We update parameters using gradient descent:

\[\hat{a}_{k+1} = \hat{a}_k - \gamma \nabla L(\hat{a}_k)\]
  • To perform this optimization, the loss (L) must be differentiable with respect to (\hat{a})

  • Define the loss function as:

\[L = \| e \|^2 = e^\top e = \sum_i (b_i - F_i(\hat{a}))^2\]
  • Then, the gradient is:
\[\nabla_{\hat{a}} L = -2 J_F(\hat{a})^\top (b - F(\hat{a}))\]
  • This ensures,
    • Continuous and smooth gradient direction
    • Analytical update expression
    • Compatibility with automatic differentiation (autodiff)
  • If instead we used the absolute error (L1 norm):
\[L = \| e \| = \sum_i |b_i - F_i(\hat{a})|\]
  • the gradient would be non-continuous at (e = 0), causing oscillation or instability during optimization.


2. Statistical

  • The squared loss corresponds to assuming Gaussian noise in the measurements

  • Assume the observation model:

\[b = F(\hat{a}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)\]
  • Then the likelihood function is:
\[p(b \mid \hat{a}) = \frac{1}{Z} \exp\!\left(-\frac{1}{2\sigma^2}\| b - F(\hat{a}) \|^2\right)\]
  • Taking the negative log-likelihood (Maximum Likelihood Estimation):
\[-\log p(b \mid \hat{a}) = \frac{1}{2\sigma^2}\| b - F(\hat{a}) \|^2 + \text{const.}\]
  • Thus minimizing the squared loss is equivalent to Maximum Likelihood Estimation (MLE) under Gaussian noise


3. Optimization

  • The squared loss amplifies large errors and stabilizes convergence

  • Large residuals receive stronger penalties:

\[L = (b - F(\hat{a}))^2\]
  • Hence,
    • Large errors are corrected faster (rapid early convergence)
    • Small errors yield smaller gradients (smooth late convergence)


Visual Computing - Coursework

Tuesday (Topic) Thursday (Topic)
Introduction to SfM
Overview of Structure-from-Motion, applications in photogrammetry, robotics, AR/VR.
The Multi-View Problem
From 2D images to 3D geometry, role of camera models.
Image Features I
Feature detection (SIFT, ORB).
Image Features II
Feature description and matching.
Epipolar Geometry
Essential matrix, Fundamental matrix.
RANSAC & Robust Estimation
Outlier rejection in correspondences.
Camera Pose Estimation I
PnP problem, intrinsics vs extrinsics.
Camera Pose Estimation II
Homography, motion from two views.
Triangulation I
Linear triangulation methods.
Triangulation II
Non-linear triangulation and uncertainty.
Incremental SfM
Sequential addition of cameras, growing reconstruction.
Global SfM
Joint optimization across all cameras.
Bundle Adjustment I
Definition and reprojection error.
Bundle Adjustment II
Nonlinear least squares, Levenberg–Marquardt optimization.
Sparse vs Dense Reconstruction
Difference between sparse SfM and dense MVS.
Multi-View Stereo (MVS)
PatchMatch, depth map fusion.
Structure Representation
Point clouds, meshes, voxel grids.
Surface Reconstruction
Poisson surface reconstruction and variants.
SfM in Practice I
COLMAP basics: input images, output formats.
SfM in Practice II
COLMAP visualization and debugging reconstruction.
Limitations of Traditional SfM
Drift, loop closure, scalability issues.
Robustness & Failures
Low-texture scenes, repetitive patterns, robustness strategies.
Extensions I: Dynamic Scenes
Non-rigid SfM, motion segmentation.
Extensions II: Large-Scale SfM
City-scale and aerial 3D reconstruction.
Learning-based SfM I
Deep feature matching (SuperGlue, LoFTR).
Learning-based SfM II
Neural reconstruction pipelines (DUSt3R, VGGT).
Future of SfM
From optimization-based to transformer-based methods.
SfM vs VGGT
COLMAP vs VGGT, comparison of pros and cons.



References 1

Frontiers in AI Research (2025)

  1. Long-Term Temporal & Structural Consistency
    • Key Results:
      • FlowFormer (CVPR ’25): flow-matching for video coherence
      • VideoMamba (25)
      • MemoryNeRF (NeurIPS ’24): implicit scene memory across seconds
    • Opportunities:
      • scalable frame-level memory modules
      • layered geometric+semantic caching
      • dynamic scene understanding

πŸ“ 2. Self-Supervised Learning from Extreme Sparsity

  • Key Results:
    • SparseMAE (ICCV ’23): masked autoencoding with <0.1 % tokens
    • Contrastive-Sparse (ICLR ’24): adaptive masking focus on high-entropy regions
  • Goals:
    • near-fully-supervised performance with ‰-level labels
    • unified multi-task pretraining (classification, detection, generation)

πŸ“ 3. DiT (Diffusion Transformer)

  • Overview: Combines Transformer context modeling with diffusion denoising
  • Examples
    1. KeyFace – speech-driven face animation via stepwise denoising
    2. DiffLocks – high-fidelity hair generation
    3. Pippo – multi-view rendering with geometric and texture coherence
  • Benefit: Maintains character appearance/style across shots and supports conditional, coherent animation

4. Priors

  • Synthetic Priors (GASP, SynShot)
    • Generate β€œpseudo-real” head avatars (poses, expressions, lighting) to enrich training data
    • Improves generalization to extreme poses and rare expressions
  • Diffusion-based Priors (CAP4D, GAF)
    • Use pretrained diffusion models to produce high-quality 3D avatars or dynamic sequences
    • Accelerates multi-view/multi-expression data generation and boosts video consistency


4D

[ Multi-view Cameras + Rig Info ]
            ↓
         (Rig3R)
     3D Scene Understanding
            ↓
   [ BEV / Map / Agent Context ]
            ↓
        (DiffusionDrive)
   Multi-Modal Trajectory Generation
            ↓
     Control & Real-Time Driving


3D


2D

  • [ViT], [DINOv3], [SAM 3]


Some Products




References








References