2025 - Master Thesis and Project

ViT, DINOv3, Semantic-SAM, Diffusion, OpenScene, NExF, DiT,
2025 - 3D Vision
2025 - Seminar in Visual Computing
2025 - Mixed Reality
Complexity of Explicit methods stems from the squared dependency of geometric optimization, Complexity of Implicit methods stems from the linear dependency of neural inference. VGGT & St4RTrack, through an Alternating Attention mechanism aligned with world coordinates, rewrite the multi-view geometry problem as a linearly parallelizable forward computation problem.
All deterministic models → Cannot generate multiple plausible deformation trajectories, however, The goal of 4D Reconstruction is not merely to reconstruct the surface deformation of a video, but rather to capture the underlying deformation space.

- Diffusion-generated high-frequency details (hair, fabric folds) often adhere to existing surfaces
- Deformable fields ensure that these details move consistently with body or surface movement

References

Polynomial vs. Recursive Construction (Essential Differences for ML & Geometry)

Aspect	Polynomial (Analytic / Global Form)	Recursive (de Casteljau / Local Form)
Influence of Control Points	Global — one control point affects the entire curve	Local — each segment depends only on nearby control points
Function Complexity	High-complexity global polynomial	Simple repeated linear interpolation
Learning Stability	Unstable (global coupling → noisy gradients)	Stable (local structure → smooth gradients)
Regularization	Weak — no inherent geometric constraints	Strong — recursive structure acts as built-in regularizer
Overfitting Risk	High	Low
Compatibility with ML	Poor for displacement or dynamic motion	Excellent for neural models (diffusion, deformation, 4D trajectories)
Extension to High Dimensions	Difficult (global interactions)	Easy (local updates generalize to 3D/4D motion)
Relation to Other Priors	—	Naturally compatible with B-Splines (local support) and natural parametrization (arc-length consistency)

Experiments Notes 2025

Colab machine has a maximum of 12 CPU threads, change to - num_workers = 12
pretrained_DINOv2, 85M .pth
- Model: DINOv2 ViT-S/14 Input: batch_size=1024, img_size=224x224, fp16, AdamW, DataLoader(num_workers=12)
- ViT-S/14, patch size = 14, Embedding Dim = 384, Parameters ~21M
Feature Extraction Backbones
- third_party/MoGe
- dust3r/semantic_prior # ViT 16*16 patch -> pixels
Upsampling - TTO style
- 2025 - Upsample Anything: A Simple and Hard to Beat Baseline for Feature Upsampling

Products

moondream

Open Problems in Traditional 3DV

Aspect	Unresolved Pain Point	Why Existing Methods Are Not Enough	Potential Directions to Solve	Existing Efforts (Yes / Partial / No)
1. Dynamic Scenes (non-static, non-rigid motion)	Traditional SfM assumes static scenes; moving or deformable objects are treated as outliers.	Even with feature-metric refinement, rigid-scene assumptions fail; motion causes inconsistent correspondences and reconstruction errors.	Introduce dynamic NeRF, scene-flow-based SfM, or implicit dynamic field modeling to capture motion and deformation over time.	Partial. Dynamic NeRFs and neural SDF-flow methods partially address non-rigid motion.
2. Illumination / Appearance / Time Variance	CNN features degrade under extreme lighting, weather, or long-term changes.	Feature spaces are static and lack temporal conditioning or appearance adaptation.	Develop time-conditioned encoders, appearance flow fields, or illumination-invariant feature spaces within SfM.	Partial. Some dynamic NeRF variants model lighting or appearance change, but classical SfM lacks such temporal modeling.
3. Extreme Viewpoint / Wide Baseline	Large viewpoint changes break local feature consistency and matching stability.	Descriptors cannot generalize across large baselines, occlusions, or drastic view changes.	Combine semantic, language, or diffusion priors for semantic-aware SfM that matches beyond local appearance.	Partial. Semantic-aware 3D reconstruction and NeRF methods show progress, but integration with classical SfM remains rare.
4. Sparse–Dense Gap	SfM yields sparse geometry; dense methods (MVS) use incompatible representations.	Sparse and dense optimization objectives differ, preventing unified reconstruction.	Employ unified implicit fields (feature fields, SDF, Gaussian splatting) that bridge sparse and dense representations.	Partial / Emerging. Implicit and Gaussian-based fields begin to unify sparse–dense paradigms.
5. Geometry–Semantic Alignment	Traditional SfM reconstructs only geometry, ignoring semantic consistency.	Lacks semantic identity or part-level alignment, limiting high-level scene understanding.	Integrate vision-language or semantic embeddings (e.g., CLIP, DINOv3) and enforce cross-view semantic regularization.	Partial. Semantic-aware 3D reconstruction is growing but still limited for geometry-based SfM.
6. Long-Term Consistency & Memory	Per-scene optimization causes drift; long-term or cross-session consistency is absent.	No temporal memory; reconstructions over time remain inconsistent.	Incorporate state-space models, latent geometry flow, or temporal latent dynamics for consistent long-term modeling.	Partial / Emerging. Some dynamic NeRFs and latent-flow models handle temporal coherence, but not integrated into SfM frameworks.

Projection for the Semantic Prior

Head Type	Alignment Level	Semantic Context	Output	Computation	Recommended Use
Linear Head	Patch-wise	Local semantics	3D patch blocks	⚡ Fast (O(N×C))	Gaussian Fur, fast inference
DPT Head	Multi-layer	Global + contextual	Dense 3D map	🧠 3–5× heavier	Full 4D reconstruction, tracking

In the Latent Space - DINOv3 (ViT-S/14)

Stage	Symbol	Description
Input image	–	`[B, 3, H, W]`
Patch embedding	`14×14` patch size → flatten	`[B, N, 384]`, where `N = (H/14)*(W/14)`
Output feature dim	`C_enc = 384`	semantic token embedding

St4RTrack (DUSt3R backbone)

Parameter	Typical value (from `dust3r/configs`)	Meaning
`enc_embed_dim`	768 or 1024	dimension of encoder tokens
`dec_embed_dim`	512 or 768	dimension expected by the decoder
`patch_size`	16	one token ↔ 16×16 input pixels

2025 - TANGO: Traversablility-Aware Navigation with Local Metric Control for Topological Goals
- policies can be trained with diffusions

2D Vision SSL Supervision

3D / 4D Reconstruction and Tracking

Conceptual Distinction

“Diffusion completion ≠ Dual-branch interaction modeling.” Although Diffusion models are powerful at image or feature completion (e.g., inpainting, conditional generation), their fundamental goal differs from that of a dual-branch Transformer. Diffusion reconstructs plausible samples from noise (probabilistic generation), while a dual-branch Transformer explicitly learns correspondences and relationships between two inputs (deterministic interaction).

Backbone Type	Example Models	Characteristics	Advantages	Limitations
U-Net (CNN)	DDPM, Stable Diffusion v1/v2	Encoder–decoder with skip connections	Strong local spatial modeling; efficient; easy to train	Limited global context; harder to scale
Vision Transformer (ViT)	DiT (Diffusion Transformer, 2023)	Patch-based transformer	Better global attention, high scalability, simple structure	Expensive to train; requires large data
Swin Transformer / Hierarchical ViT	UViT, UDiT	Combines U-Net hierarchy + ViT attention	Preserves multi-scale info while gaining Transformer power	More complex to tune
ConvNeXt / ResNet	Some lightweight diffusion variants	CNN-based but more modern than U-Net	High efficiency, simple to integrate	Limited generative flexibility
Graph Neural Network (GNN)	Molecular / 3D Diffusion models (e.g., EDM, GeoDiff)	Models relations over nodes/atoms	Structured data modeling (e.g., point cloud, molecules)	Not suited for grid data
Implicit MLP (NeRF-style)	Score-based NeRF / Diffusion Fields	Continuous signal representation	Excellent for 3D continuous spaces	Slow, not scalable for large images

Dealing With Continuous 3D Input Datasets

+---------------------------+
|  Discrete Token Sequence  |
|  ["age", "BMI", "asthma"] |
+------------+--------------+
             |
             v
+---------------------------+
|  Token Embedding Matrix   |
+---------------------------+
             |
             v
+---------------------------+
| Multi-Head Self-Attention |
|     O(N^2) complexity     |
+---------------------------+
             |
             v
+---------------------------+
|     Feed-Forward Layer    |
+---------------------------+
             |
             v
+---------------------------+
|   Output: Token-to-Token  |
|     symbolic reasoning    |
+---------------------------+

images
depth maps
3D point clouds
3D meshes
medical waveforms (ECG, PPG, Doppler)
neural biosignals (EEG/MEG)
Issue: Discrete symbolic models cannot directly represent geometry.

Information Loss

 Continuous Data (image/depth/EEG/point cloud)
           |
           v
+------------------------------+
| Continuous Encoder (CNN/ViT) |
+------------------------------+
           |
           v
   Project to k "fake tokens"
           |
           v
+------------------------------+
| Traditional LLM Transformer  |
|   (expects symbolic tokens)  |
+------------------------------+
           |
           v
   LLM pretends to "understand"
   → But geometry/topology lost

Attended Master Thesis Project Defense at CVG

01 Sep 2025 - 📍 Reconstructing Complete Garments with Foundation Models
- Pattern Prediction on Fabric Recognition
- ‘Garments are both cultural artifacts and engineered products, but most generative models produce visuals that cannot be manufactured. This thesis introduces a foundation model for pattern-centric garment generation, where outputs are sewing patterns—panels, seams, and annotations—ready for CAD and simulation. A new tokenizer and multimodal dataset enable structured decoding from text or image inputs in a unified framework. In parallel, we investigate fabric recognition from large-scale product data, underscoring the challenge of linking garment shape to material behavior. Experiments show state-of-the-art pattern prediction, strong generalization, and predictable scaling. Together, these contributions move digital fashion toward simulation-ready, fabrication-oriented design.’
- 2025 - AIpparel: A Multimodal Foundation Model for Digital Garments
- 2025 - Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates
10 Sep 2025 - An Interactive, Foundation-Model-Empowered Video Annotation Interface for Constructing a Challenging Video Object Segmentation Dataset
- SAM 2, DINOv2, GPT-4o, 📍 real-time Annotation
- demo - nutsh
02 Oct 2025 - VSLAM-LAB: A Comprehensive Framework for Visual SLAM Baselines and Datasets, pixi
- 2024 - Gaussian Splatting SLAM, demo
09 Oct 2025 - Controllable Visual Generation using 3D prior
- HIL D 55.2, ETH Hönggerberg, 14:00
StreamSplat: A Framework for Self-Supervised, Online Novel View Synthesis
- Friday Oct 10th, 15:00-15:30 Zurich Time
(Uncertainty-Aware 3D Mapping, Monday, October 13th, Zoom)
From Sensors to Solutions: Permanent Laser Scanning in 📍 Environmental Monitoring
- Prof. Dr.-Ing. Daniel Czerwonka-Schröder, 22 Oct 2025, 4:45 p.m, HIL D 53, Hönggerberg
(iTwRL: Interactable Digital Twin for Reinforcement Learning, Nov 17th, 2pm, Zoom)
Learning 3D Human Foundation Models: A Data Request, Prov. Siyu Tang, Mon, 24-Nov-2025, 13:00–14:00, HG D 16.2
- Computational methods to model human motion and behavior from visual inputs in real-world environments, 📍 Non-rigid 4D Reconstruction and Tracking
Guided Monocular Depth Estimation, Mon, Dec 8th, 2pm, CNB G 110 and Zoom, Sophie Selgrad

During Training

Implementation style	Number of backward calls	Gradient behavior	Characteristics
`(loss1 + loss2).backward()`	1	Computes gradients jointly	Simpler, but combined gradients are less controllable
Separate `backward()` calls	2	Computes gradients individually	Better suited for tasks requiring different weighting or multi-branch networks (e.g., semantic + diffusion loss)

📍 Semantic vs. Photometric Consistency

In traditional SfM / MVS / NeRF pipelines, pixel correspondence is established by enforcing photometric consistency across views:

\[L_{\text{photo}} = \| I_t(p) - I_{t'}(w(p)) \|,\]

where $I_t(p)$ is the pixel intensity at location $p$ in frame $t$, and $w(p)$ is the projection of that pixel into the target frame $t’$ using the estimated geometry.

In contrast, DINOv3, built on the Vision Transformer (ViT), replaces raw pixel comparison with semantic feature consistency:

\[L_{\text{semantic}} = \| f_{\text{DINO}}(I_t(p)) - f_{\text{DINO}}(I_{t'}(w(p))) \|,\]

where $f_{\text{DINO}}(\cdot)$ denotes patch-level semantic embeddings extracted by DINOv3.

Because these $f_{\text{DINO}}$ features are patch-level and semantically stable, they remain consistent under viewpoint changes, illumination variations, and partial occlusions— enabling robust cross-frame and cross-view alignment beyond raw photometric matching.

From 2D to 3D/4D reconstruction is a highly ill-posed inverse problem, Projection model:

\[I(x, y) = \Pi(X, Y, Z)\]

where the projection operator $\Pi$ maps a 3D point in the world coordinate space to a 2D pixel on the image plane
The inverse problem is:

\[\text{Given } I(x, y), \; \text{solve for } (X, Y, Z).\]

Two-dimensional pixel observations alone cannot uniquely deduce the true three-dimensional structure, so traditional methods rely heavily on geometric priors and multi-view constraints

Traditional Pairwise Pipeline (O(T²))

      I1 —— I2 —— I3 —— ... —— IT
       ↕    ↕    ↕          ↕
      (I1,I2), (I1,I3), (I2,I3), (I3,I4), ... (IT-1,IT)
       └────────────── Dense Pairwise Matching ───────────────┘

Anchor-based Pipeline (O(T))

       I1 ──────────▶ I2
        │            │
        │            ▼
        │──────────▶ I3
        │            │
        │            ▼
        │──────────▶ I4
        │            │
        │            ▼
        └──────────▶ IT
   (fixed anchor frame)

→ Each frame j only forms one pair (I1, Ij)
→ Each pair predicts (X^1_j, X^j_j) in the same world coordinate

Diffusion Models vs. Flow Matching

Problem Type	Diffusion Models	Flow Matching
Non-rigid deformation	Implicitly learns temporal consistency through noise-to-structure denoising	Requires explicit motion supervision
Photometric inconsistency	Learns semantic-level feature representations beyond pixel matching	Still relies on local intensity differences in continuous flow fields
Sparse-view limitation	Uses generative priors to “complete” missing geometry	Requires sufficient observation constraints
Ill-posed inverse problem	Models latent structure distributions probabilistically via diffusion	Deterministic ODE mapping, sensitive to noise
Temporal consistency	Implicit diffusion process provides natural temporal smoothness	Lacks explicit temporal regularization

Multi-View Matching

Problem Definition

Given a set of frames
$\{ I_1, I_2, \dots, I_T \},$
and feature points extracted from each frame
$\{ f_i^t \},$
the goal is to find cross-frame correspondences
$\pi: (f_i^t) \mapsto (f_j^{t+k}),$
such that they represent the same real-world 3D point.

Why It Is NP-hard

This problem is equivalent to graph matching:
- Each frame’s feature points form a node set.
- Correspondences between frames are edges.
- Matches must satisfy both geometric (epipolar) and temporal consistency constraints.
- The optimal matching minimizes:
\[\min_{\pi} \sum_{t,k} \| P_t(f_i^t) - P_{t+k}(f_{\pi(i)}^{t+k}) \|^2,\]
where $P_t$ is the projection matrix. When the number of views exceeds two, the search space grows exponentially. Multi-view matching can be reduced to the Quadratic Assignment Problem (QAP), which is a classical NP-hard problem.

Simplified and Practical Approaches

Method	Principle	Time Complexity	Integration Module
Soft Attention Matching (Transformer)	Replace hard matching with attention weights	O(T·N²)	St4RTrack Encoder
Epipolar Constraint Filtering	Geometric pre-filtering before soft match	O(N log N)	Projection Stage
Hough-Voting / DINO Semantic Alignment	Use semantic token similarity for weakly supervised matching	O(N) per frame	DINO Semantic Prior
Graph Cut / Sinkhorn Normalization	Approximate discrete matching via differentiable assignment	O(N³)	Differentiable Alignment

Ground Truth for 4D Reconstruction

Dataset	Scene Type	Provided Ground Truth	Used for Supervision
Point Odyssey (PO)	Dynamic scenes with known camera trajectory and per-vertex motion (long-term tracks)	3D vertex positions over time (x, y, z, t) and camera extrinsics	✅ Tracking Branch: uses 4D vertex trajectories ✅ Reconstruction Branch: uses per-frame mesh depth
Dynamic Replica (DR)	Real scanned scenes with dynamic motion (camera + objects)	Known static/dynamic meshes and GT camera pose	✅ Reconstruction: per-frame geometry ✅ Tracking: vertex correspondences
Kubric	Synthetic multi-object scenes	Depth map, instance segmentation, camera pose	✅ Reconstruction only (depth/geometry); weak tracking

Step-by-Step: Constructing GT Pointmaps

Example: Point Odyssey

Each frame provides the scene mesh vertices in world coordinates:

\[V_t = \{ v_k^t \in \mathbb{R}^3 \mid k = 1, \dots, N \}\]

where each $v_k^t$ is a 3D vertex position at time $t$.

For each image pixel $(u, v)$, find its corresponding mesh vertex (via rasterization or ray casting):

\[X_t(u,v) = \text{mesh2image}(V_t)\]

This is the ground-truth pointmap at time $t$.

1. Tracking Branch Supervision

For the same vertex across time $i \to j$:

\[X^i_j(u,v)^{GT} = V^j_k - V^i_k\]

provided by the dataset’s 4D trajectories.

2. Reconstruction Branch Supervision

Per-frame depth maps or meshes provide supervision for:

\[X^j_j\]

which represents geometry reconstruction at time $j$.

All GT pointmaps are expressed in a unified world coordinate frame, transformed using the first frame’s camera extrinsics.

Aligned Results — How They Are Computed

During training and evaluation, predicted and GT pointmaps may differ in scale, rotation, or translation.
Alignment ensures they are comparable.

Step 1: Global Median Scale Alignment (default)

For each sequence, the predicted and GT pointmaps are scale-normalized:

\[s = \text{median}\left( \frac{ \| GT_i \| }{ \| Pred_i \| } \right), \quad Pred \leftarrow s \cdot Pred\]

Step 2: SIM(3) Alignment (used in evaluation)

In Tables 4 and 5, an additional similarity transform alignment (scale + rotation + translation) is applied using the Procrustes algorithm:

\[\min_{R, t, s} \sum_i \| GT_i - (s R Pred_i + t) \|^2\]

The aligned prediction is then used to report metrics such as APD₃ᴰ and EPE.

In summary:

Ground-truth pointmaps are rasterized from dataset-provided 4D meshes, expressed in the world coordinate frame.
During evaluation, predictions are scale- or SIM(3)-aligned to these GT pointmaps before computing accuracy metrics.

Explicit vs. Implicit 3D Geometry Computation

Aspect	Explicit (Geometric Pipeline)	Implicit (Neural Pipeline)
Example Setup	Suppose you have a video with 10 frames.	The same 10 frames are input to a neural model.
Pairwise Matching	You must construct 10 × 10 = 100 image pairs for feature matching.	No explicit pair construction — attention layers automatically learn inter-frame relations.
Feature Matching Cost	Each pair requires thousands of RANSAC iterations to reject outliers.	Correlation across all frames is learned once through self-attention and optimized end-to-end.
Pose Estimation	Each image pair needs PnP / Essential Matrix estimation to recover relative camera poses.	The network implicitly infers all camera poses from global attention and latent camera tokens.
Global Optimization	Requires Bundle Adjustment (BA) over thousands of variables (poses + 3D points).	A single forward pass of the network jointly refines all poses and points.
Computation Dependency	Each stage depends on the previous step (matching → pose → triangulation → BA), making the process sequential and non-parallelizable.	Entire pipeline is feed-forward; all operations are differentiable and GPU-parallelizable tensor computations.
Computational Complexity	Typically O(T²) to O(T³) due to pairwise matching and optimization across frames.	Approximately O(T) with Alternating-Attention (frame-wise + global), scalable to hundreds of frames.
Runtime	Minutes to hours, depending on number of frames and optimization steps.	Milliseconds to seconds for full reconstruction.
Memory Usage	High — needs to store large Jacobians, keypoints, and pairwise constraints.	Moderate — mainly token embeddings and attention maps.
Output	Camera intrinsics/extrinsics, sparse or dense 3D structure after optimization.	Cameras, depth maps, and dense world-space point maps produced directly from the network.
Parallelizability	Low — iterative geometric solvers are inherently serial.	High — all computations are matrix multiplications on GPU.
Interpretability	High (based on explicit geometry equations).	Lower interpretability — geometry is implicitly encoded in network weights.
Representative Methods	SfM, COLMAP, MVSNet, NeRF (explicit camera poses).	VGGT, St4RTrack, MapAnything (implicit world-frame prediction).

DL For 3D Reconstruction

Bottleneck Source	Limitation of Classical Geometry Methods	Deep Learning Improvement Strategy
Dynamic Scenes	Assume the scene is static	Introduce temporal modeling (RNN / GRU / Transformer) and learn deformation fields (e.g., D-NeRF, HyperNeRF)
Sparse Viewpoints	Insufficient view redundancy	Use pretrained priors, shape priors, or diffusion priors to fill in missing geometric information
Real-time Requirement	Optimization is slow and iterative	Replace optimization with feed-forward neural networks and learned depth estimators
Weak or No Supervision	Depend on accurate labels and calibration	Train via photometric consistency and self-supervised losses (e.g., Monodepth, NeuralRecon)
Complex Illumination and Reflection	Simplified lighting model (Lambertian assumption)	Learn implicit neural representations that model reflection and BRDF properties
Temporal Consistency	Treat each frame independently	Apply ConvGRU, flow matching, or diffusion-based temporal smoothing to maintain cross-frame consistency

Representative Methods

Model	Year	Core Idea
NeRF	2020	Learns an implicit volumetric rendering function $f_\theta(x,d)$ to recover high-quality 3D scenes without traditional geometric optimization.
D-NeRF / HyperNeRF	2021	Extends NeRF by learning non-rigid deformations along the temporal dimension, enabling 4D dynamic reconstruction.
4D-LRM / 4D-Fly	2025	Combines large-scale pretraining and temporal-consistency modules to reconstruct scenes from any view at any time.

Random Matrix

Eigenvalues of large random matrices are statistically distributed across different systems

Wigner’s Semicircle Law

ρ(λ) = (1 / 2πσ²) √(4σ² − λ²), |λ| < 2σ

Marčenko–Pastur Law

$\rho(\lambda) = \frac{1}{2\pi\sigma^{2} c \lambda} \sqrt{(\lambda_{+} - \lambda)(\lambda - \lambda_{-})}, \quad \lambda_{\pm} = \sigma^{2}(1 \pm \sqrt{c})^{2}$

Benchmarks and SOTAs

1. 4D

2. 3D

2024 - AGILE3D
2016 - COLMAP 1 - baseline 1
2025 - COLMAP 2 - baseline 2

Key Contributions

[2000s] Classical 2D Stitching
   - Euclidean / Affine / Homography
   - Used in panoramas, satellite mosaics, medical imaging
   - Fast, lightweight, real-time

   Euclidean (3 DOF)
   ▢ → ▢
   Rigid rotation + shift
        |
        v
   Affine (6 DOF)
   ▢ → ⬠
   Parallel preserved (shear, scaling)
        |
        v
   Projective (8 DOF)
   ▢ → ⬳
   Perspective distortion (vanishing point)
-------------------------------------------------
        |
        v
[2010s] Multi-View Geometry
   - SfM (Structure from Motion), SLAM
   - SE(3) rigid motion + Bundle Adjustment
   - Full 3D scene reconstruction (static environments)
        |
        v
[2020s] Neural Implicit Representations
   - NeRF (Neural Radiance Fields)
   - Gaussian Splatting, Dynamic NeRF
   - Rich photorealistic 3D, supports dynamics
        |
        v
[2025 → ] Transformer & Foundation Models
   - VGGT (Geometry → Transformer sequence modeling)
   - DINOv3 (7B SSL backbone, dense visual features)
   - Replaces manual geometry → universal representations
   - Powers Pixel 10 AI (Gemini Nano + Tensor G5)

2025 - VGGT
📍 2023 - OpenScene - Open set
2018 - GQN - SSL, Neural scene representation

Neural Differential Equations

1. Core Idea

Neural Differential Equations (NDEs) generalize neural networks to continuous depth.
Instead of discrete layer updates, the hidden state evolves continuously over time according to an ordinary differential equation (ODE):

\[\frac{d\mathbf{z}(t)}{dt} = f_\theta(\mathbf{z}(t), t), \quad \mathbf{z}(t_0) = \mathbf{z}_0\]

The solution is obtained by integrating over time:

\[\mathbf{z}(t_1) = \mathbf{z}(t_0) + \int_{t_0}^{t_1} f_\theta(\mathbf{z}(t), t)\,dt\]

2. Comparison with Standard Neural Networks

Property	Standard NN	Neural Differential Equation
Structure	Discrete layers	Continuous dynamics
Forward pass	$h_{k+1} = f_\theta(h_k)$	$\frac{dh}{dt} = f_\theta(h,t)$
Depth	Fixed	Continuous
Backpropagation	Chain rule	Adjoint sensitivity method
Interpretation	Layer mapping	Continuous-time dynamical system

3. Training via the Adjoint Method

Gradients are computed by solving an adjoint ODE backward in time:

\[\frac{da(t)}{dt} = -a(t)^\top \frac{\partial f_\theta(\mathbf{z}(t),t)}{\partial \mathbf{z}}, \quad \frac{dL}{d\theta} = -\int_{t_1}^{t_0} a(t)^\top \frac{\partial f_\theta(\mathbf{z}(t),t)}{\partial \theta}\,dt\]

This allows memory-efficient gradient computation since intermediate states do not need to be stored

4. Variants

Neural ODE: Deterministic dynamics
Neural SDE: Stochastic systems with noise
Neural PDE: Parameterized partial differential equations
Hamiltonian NN: Conserves physical energy
Controlled DE: Handles continuous control inputs

5. Physical Interpretation

The function ( f_\theta ) acts as a learnable vector field that defines how the system evolves in time
This enables learning unknown physical dynamics directly from data:

\[\frac{\partial u}{\partial t} = f_{\text{known}}(u) + f_\theta(u)\]

Speed up Your Training in Multiple Ways

Category	Technique / Concept	Core Idea (One Line)	Example Hardware / Framework
Kernel Optimization	AI Kernel Auto-Tuning (Triton / AutoTVM)	Automatically generate and fuse optimal CUDA kernels for specific tensor shapes.	NVIDIA H100, PyTorch 2.5, Triton 2
Memory Efficiency	FlashAttention 2 / Fused Ops	Combine attention and softmax in a single kernel to reduce memory and launch overhead.	A100 / H100 / RTX 4090
Precision Optimization	FP8 / INT8 Quantization-Aware Training	Use ultra-low precision arithmetic with adaptive scaling for faster, energy-efficient training.	NVIDIA Hopper, AMD MI300X
Graph Compilation	TorchInductor / XLA / MetalFX Graph Capture	Compile dynamic graphs into optimized static kernels for faster runtime.	PyTorch 2.x, TPU v6e, Apple M4 Max
Diffusion Optimization	Fused Diffusion Sampling	Merge denoising and upsampling into one fused kernel for diffusion-based training.	DiffusionRefine, Stable Diffusion Turbo
Adaptive Fine-Tuning	LoRA / QLoRA / BitNet Adapters	Parameter-efficient fine-tuning for large models on limited hardware.	RTX 6000 Ada, M2 Ultra, Edge TPU
Sequence Modeling	Mamba SSM / Linear Attention	Replace quadratic attention with state-space or linear-time models for long-sequence efficiency.	Transformer Engine (H100), FlashMamba
Hardware Co-Design	Grace Hopper / TPU v6e / MI300X	Unified CPU–GPU/TPU memory design enabling zero-copy tensor access.	NVIDIA GH200, Google TPU Pods
Distributed Optimization	FSDP + ZeRO + NVLink 4.0	Fully sharded data parallelism and high-speed interconnect for multi-GPU scaling.	DGX H100 Cluster, NVSwitch
Emerging Paradigm	Neural Compilation & Auto-Scheduling	Learn to generate compute graphs and schedule execution automatically.	TVM Unity, Modular Mojo

Models

Points → Delaunay Triangulation (Triangles)
      ○         ○───────○
       \       / \     /
        ○─────○───○───○
       / \     \ /     \
      ○   ○─────○──────○


Points → Voronoi Diagram (Cells)
      ○     │     ○
     ┌┼─────┼─────┼┐
     │ Cell │ Cell │
  ○──┼──────┼──────┼──○
     │ Cell │ Cell │
     └┼─────┼─────┼┘
      ○     │     ○


Points → Poisson / α-shapes (Smooth Surface)
        ●──────────●
      ╱              ╲
    ●                  ●
    ╲                  ╱
      ●──────────────●


Points → Volumetric / TSDF (Voxel Grid)
   ▓▓▓▓▓
   ▓███▓    Each cube = voxel
   ▓▓▓▓▓


Points → Implicit Fields (SDF / NeRF)
   f(x,y,z) = 0  → Surface
   Continuous function learned by NN
   "Shape emerges from equations"


Points → Modern Neural Models (GS / Transformer)
   ● Gaussian Splatting → soft blobs in 3D
   ● VGGT / MonST3R / PanSt3R → End-to-end feed-forward 3D/4D
   ● NeRF → Radiance fields, view-dependent rendering

Motion

Structure-from-Motion (SfM) Pipeline

Input: Multiple images (Image Sequence)
   ↓
1️⃣ Feature Extraction  
   - Detect keypoints and compute descriptors  
   - Methods: SIFT, ORB, SuperPoint, D2-Net  
   ↓
2️⃣ Feature Matching  
   - Find correspondences across images  
   - Techniques: Nearest Neighbor, RANSAC, StereoGlue  
   ↓
3️⃣ Camera Motion Estimation  
   - Estimate relative poses using Essential / Fundamental Matrix  
   - Recover camera extrinsics (Rotation R, Translation t)  
   ↓
4️⃣ Triangulation  
   - Back-project matched points  
   - Compute 3D scene points (sparse point cloud)  
   ↓
5️⃣ Bundle Adjustment (BA)  
   - Global non-linear optimization  
   - Refine camera poses and 3D points  
   - Minimize reprojection error  
   ↓
6️⃣ Output  
   - Optimized 3D point cloud (sparse or dense)  
   - Camera trajectory (motion path)  

Visual SLAM Pipeline

Input Images (RGB / RGB-D / Stereo)
        ↓
Front-End Tracking
   - Feature Extraction (ORB, SuperPoint)
   - Feature Matching (KLT, StereoGlue)
   - Motion Estimation (PnP, Essential Matrix)
        ↓
Back-End Optimization
   - Bundle Adjustment (BA)
   - Sliding Window Optimization
        ↓
Loop Closure
   - Place Recognition
   - Pose Graph Optimization
        ↓
Mapping
   - Sparse Map (Point Cloud)
   - Dense Map (Depth / Voxel / Mesh)
   - Semantic Map (Object / Scene Labels)
        ↓
Output: Robust Trajectory + Map

Visual Computing

2D → 3D Projection World                     Multi-View Segmentation World
═══════════════════════════════              ══════════════════════════════════
Pixel Point  →  Camera Intrinsics →          Multi-View Image →  Camera Extrinsics →
Corrected by Distortion → Project to 2D      Align Views Consistently → Back-Project to 3D
    ↓                   ↓                          ↓                       ↓
┌────────────┐   ┌────────────┐              ┌────────────┐   ┌────────────────┐
│ Pixel Coord│ → │ Metric Ray │      vs.     │ Seg. Mask  │ → │ 3D Point Cloud │
│ (u,v)      │   │ (K Matrix) │              │ (2D Image) │   │  or Voxels     │
└────────────┘   └────────────┘              └────────────┘   └────────────────┘
    ↓                   ↓                          ↓                 ↓
Distortion-Free    Accurate Geometry           Consistent 3D     Semantic Labels
Projection         Pixel → Metric Space        Reconstruction   in 3D Space


Summary:
1. Intrinsics: Ensure pixels map to correct metric coordinates
2. Extrinsics: Align multi-view cameras consistently
3. Distortion Params: Correct lens errors
4. Projection: World point → Image point
5. Back-Projection: Pixel + depth → World point
6. Goal: Lift 2D segmentation masks into 3D semantic segmentation

Camera = Projector (2D Screen View)
Extrinsics = GPS for Camera Pose
Segmentation = Paint Mask that Becomes 3D Object

Classical SfM vs. VGGT

 Classical SfM / MVS World                   VGGT World
═══════════════════════════════════         ════════════════════════════════════
Find Keypoints → Match Pairs →              Drop Images → Transformer Thinks →
Estimate Pose → Triangulate →               One Forward Pass → Geometry Pops Out
Optimize BA → Wait Forever                  (Pose, Depth, Points, Tracks in ms)
     ↓                   ↓                          ↓
┌─────────────┐   ┌──────────────┐           ┌───────────────┐   ┌────────────────┐
│ Feature     │ → │ Epipolar     │    vs.    │ Transformer   │ → │ Unified Outputs│
│ Matching    │   │ Geometry     │           │ Global Context│   │ (Pose+Depth+3D)│
└─────────────┘   └──────────────┘           └───────────────┘   └────────────────┘
     ↓                   ↓                          ↓                    ↓
Fragile Matches     Heavy Optimization         Robust Priors        Instant Geometry
(SIFT/SuperPoint)   (Bundle Adjustment)        Learned Attention    Feed-forward Only

Hybrid approaches:
1. Use classical SfM to bootstrap intrinsics → fine-tune with VGGT outputs
2. Combine hand-crafted geometry checks (epipolar) with learned global priors

Classical SfM = Puzzle Builder with Thousands of Pieces (slow, error-prone)
VGGT = Instant Polaroid Printer that Prints 3D (fast, all-in-one)

VGGT (Learning-driven)

Input Images
   ↓
Patch Embedding (DINO)
   ↓
Camera Tokens + Self-Attention
   ↓
Feed-forward Transformer
   ↓
Outputs (Direct Prediction)
   • Intrinsics K
   • Extrinsics [R | t]
   • Depth Maps
   • Point Maps
   • 3D Tracks

Why Squared (L2) Loss

1. Mathematical

Squaring makes the error smooth, continuous, and differentiable, which is required for gradient-based optimization
We update parameters using gradient descent:

\[\hat{a}_{k+1} = \hat{a}_k - \gamma \nabla L(\hat{a}_k)\]

To perform this optimization, the loss (L) must be differentiable with respect to (\hat{a})
Define the loss function as:

\[L = \| e \|^2 = e^\top e = \sum_i (b_i - F_i(\hat{a}))^2\]

Then, the gradient is:

\[\nabla_{\hat{a}} L = -2 J_F(\hat{a})^\top (b - F(\hat{a}))\]

This ensures,
- Continuous and smooth gradient direction
- Analytical update expression
- Compatibility with automatic differentiation (autodiff)
If instead we used the absolute error (L1 norm):

\[L = \| e \| = \sum_i |b_i - F_i(\hat{a})|\]

the gradient would be non-continuous at (e = 0), causing oscillation or instability during optimization.

2. Statistical

The squared loss corresponds to assuming Gaussian noise in the measurements
Assume the observation model:

\[b = F(\hat{a}) + \epsilon, \quad \epsilon \sim \mathcal{N}(0, \sigma^2 I)\]

Then the likelihood function is:

\[p(b \mid \hat{a}) = \frac{1}{Z} \exp\!\left(-\frac{1}{2\sigma^2}\| b - F(\hat{a}) \|^2\right)\]

Taking the negative log-likelihood (Maximum Likelihood Estimation):

\[-\log p(b \mid \hat{a}) = \frac{1}{2\sigma^2}\| b - F(\hat{a}) \|^2 + \text{const.}\]

Thus minimizing the squared loss is equivalent to Maximum Likelihood Estimation (MLE) under Gaussian noise

3. Optimization

The squared loss amplifies large errors and stabilizes convergence
Large residuals receive stronger penalties:

\[L = (b - F(\hat{a}))^2\]

Hence,
- Large errors are corrected faster (rapid early convergence)
- Small errors yield smaller gradients (smooth late convergence)

Visual Computing - Coursework

Tuesday (Topic)	Thursday (Topic)
Introduction to SfM Overview of Structure-from-Motion, applications in photogrammetry, robotics, AR/VR.	The Multi-View Problem From 2D images to 3D geometry, role of camera models.
Image Features I Feature detection (SIFT, ORB).	Image Features II Feature description and matching.
Epipolar Geometry Essential matrix, Fundamental matrix.	RANSAC & Robust Estimation Outlier rejection in correspondences.
Camera Pose Estimation I PnP problem, intrinsics vs extrinsics.	Camera Pose Estimation II Homography, motion from two views.
Triangulation I Linear triangulation methods.	Triangulation II Non-linear triangulation and uncertainty.
Incremental SfM Sequential addition of cameras, growing reconstruction.	Global SfM Joint optimization across all cameras.
Bundle Adjustment I Definition and reprojection error.	Bundle Adjustment II Nonlinear least squares, Levenberg–Marquardt optimization.
Sparse vs Dense Reconstruction Difference between sparse SfM and dense MVS.	Multi-View Stereo (MVS) PatchMatch, depth map fusion.
Structure Representation Point clouds, meshes, voxel grids.	Surface Reconstruction Poisson surface reconstruction and variants.
SfM in Practice I COLMAP basics: input images, output formats.	SfM in Practice II COLMAP visualization and debugging reconstruction.
Limitations of Traditional SfM Drift, loop closure, scalability issues.	Robustness & Failures Low-texture scenes, repetitive patterns, robustness strategies.
Extensions I: Dynamic Scenes Non-rigid SfM, motion segmentation.	Extensions II: Large-Scale SfM City-scale and aerial 3D reconstruction.
Learning-based SfM I Deep feature matching (SuperGlue, LoFTR).	Learning-based SfM II Neural reconstruction pipelines (DUSt3R, VGGT).
Future of SfM From optimization-based to transformer-based methods.	SfM vs VGGT COLMAP vs VGGT, comparison of pros and cons.

References 1

Frontiers in AI Research (2025)

Long-Term Temporal & Structural Consistency
- Key Results:
  - FlowFormer (CVPR ’25): flow-matching for video coherence
  - VideoMamba (25)
  - MemoryNeRF (NeurIPS ’24): implicit scene memory across seconds
- Opportunities:
  - scalable frame-level memory modules
  - layered geometric+semantic caching
  - dynamic scene understanding

📍 2. Self-Supervised Learning from Extreme Sparsity

Key Results:
- SparseMAE (ICCV ’23): masked autoencoding with <0.1 % tokens
- Contrastive-Sparse (ICLR ’24): adaptive masking focus on high-entropy regions
Goals:
- near-fully-supervised performance with ‰-level labels
- unified multi-task pretraining (classification, detection, generation)

📍 3. DiT (Diffusion Transformer)

Overview: Combines Transformer context modeling with diffusion denoising
Examples
1. KeyFace – speech-driven face animation via stepwise denoising
2. DiffLocks – high-fidelity hair generation
3. Pippo – multi-view rendering with geometric and texture coherence
Benefit: Maintains character appearance/style across shots and supports conditional, coherent animation

4. Priors

Synthetic Priors (GASP, SynShot)
- Generate “pseudo-real” head avatars (poses, expressions, lighting) to enrich training data
- Improves generalization to extreme poses and rare expressions
Diffusion-based Priors (CAP4D, GAF)
- Use pretrained diffusion models to produce high-quality 3D avatars or dynamic sequences
- Accelerates multi-view/multi-expression data generation and boosts video consistency

4D

2025 - 4DNex - 4d scene understanding
2024 - CAT4D - 4d Reconstruction from video

[ Multi-view Cameras + Rig Info ]
            ↓
         (Rig3R)
     3D Scene Understanding
            ↓
   [ BEV / Map / Agent Context ]
            ↓
        (DiffusionDrive)
   Multi-Modal Trajectory Generation
            ↓
     Control & Real-Time Driving

Related Coursework

References

Polynomial vs. Recursive Construction (Essential Differences for ML & Geometry)

Experiments Notes 2025

Products

Open Problems in Traditional 3DV

Projection for the Semantic Prior

In the Latent Space - DINOv3 (ViT-S/14)

St4RTrack (DUSt3R backbone)

Navigation - Generalization

2D Vision SSL Supervision

3D / 4D Reconstruction and Tracking

Conceptual Distinction

Dealing With Continuous 3D Input Datasets

Continuous Modal Inputs Include:

Information Loss

Attended Master Thesis Project Defense at CVG

During Training

📍 Semantic vs. Photometric Consistency

Traditional Pairwise Pipeline (O(T²))

Anchor-based Pipeline (O(T))

Diffusion Models vs. Flow Matching

Multi-View Matching

Why It Is NP-hard

Simplified and Practical Approaches

Ground Truth for 4D Reconstruction

Step-by-Step: Constructing GT Pointmaps

1. Tracking Branch Supervision

2. Reconstruction Branch Supervision

Aligned Results — How They Are Computed

Explicit vs. Implicit 3D Geometry Computation

DL For 3D Reconstruction

Representative Methods

Random Matrix

Benchmarks and SOTAs

1. 4D

2. 3D

Key Contributions

Some Related Art works

Neural Differential Equations

Speed up Your Training in Multiple Ways

Models

Motion

Structure-from-Motion (SfM) Pipeline

Visual SLAM Pipeline

Visual Computing

Classical SfM vs. VGGT

VGGT (Learning-driven)

Why Squared (L2) Loss

1. Mathematical

2. Statistical

3. Optimization

Visual Computing - Coursework

References 1

4D

3D

2D

Some Products

References

References