2025 - Master Thesis and Project
4D Reconstruction, Connectomics, CVG, USZ, SCAI
Related Coursework
- ViT, DINOv3, Semantic-SAM, Diffusion, OpenScene, NExF, DiT,
- 2025 - 3D Vision
- 2025 - Seminar in Visual Computing
-
Complexity of
Explicitmethods stems from the squared dependency of geometric optimization, Complexity ofImplicitmethods stems from the linear dependency of neural inference.VGGT & St4RTrack, through an Alternating Attention mechanism aligned withworld coordinates, rewrite the multi-view geometry problem as alinearly parallelizable forward computationproblem. - All deterministic models β
Cannot generate multiple plausible deformation trajectories, however, The goal of 4D Reconstruction is not merely to reconstruct the surface deformation of a video, but rather tocapture the underlying deformation space.
- Diffusion-generated high-frequency details (hair, fabric folds) often adhere to existing surfaces
- Deformable fields ensure that these details move consistently with body or surface movement
References
- 2025 - SegSplat: Feed-forward Gaussian Splatting and Open-Set Semantic Segmentation, engineering one
- 2025 - 4DGT: Learning a 4D Gaussian Transformer Using Real-World Monocular Videos, NeurIPS 2025 (Spotlight)
- 2025 - Transfer learning between different computer vision tasks
- π 2021 - Pixel-Perfect Structure-from-Motion with Featuremetric Refinement, CVG, ICCV
- 2025 - RocSync: π Temporal Multi-Camera Synchronization, CVG
Polynomial vs. Recursive Construction (Essential Differences for ML & Geometry)
| Aspect | Polynomial (Analytic / Global Form) | Recursive (de Casteljau / Local Form) |
|---|---|---|
| Influence of Control Points | Global β one control point affects the entire curve | Local β each segment depends only on nearby control points |
| Function Complexity | High-complexity global polynomial | Simple repeated linear interpolation |
| Learning Stability | Unstable (global coupling β noisy gradients) | Stable (local structure β smooth gradients) |
| Regularization | Weak β no inherent geometric constraints | Strong β recursive structure acts as built-in regularizer |
| Overfitting Risk | High | Low |
| Compatibility with ML | Poor for displacement or dynamic motion | Excellent for neural models (diffusion, deformation, 4D trajectories) |
| Extension to High Dimensions | Difficult (global interactions) | Easy (local updates generalize to 3D/4D motion) |
| Relation to Other Priors | β | Naturally compatible with B-Splines (local support) and natural parametrization (arc-length consistency) |
Experiments Notes 2025
- Colab machine has a maximum of 12 CPU threads, change to - num_workers = 12
- pretrained_DINOv2, 85M .pth
- Model: DINOv2 ViT-S/14 Input: batch_size=1024, img_size=224x224, fp16, AdamW, DataLoader(num_workers=12)
- ViT-S/14, patch size = 14, Embedding Dim = 384, Parameters ~21M
- Feature Extraction Backbones
- third_party/MoGe
- dust3r/semantic_prior # ViT 16*16 patch -> pixels
- Upsampling - TTO style
Products
- moondream
Open Problems in Traditional 3DV
| Aspect | Unresolved Pain Point | Why Existing Methods Are Not Enough | Potential Directions to Solve | Existing Efforts (Yes / Partial / No) |
|---|---|---|---|---|
| 1. Dynamic Scenes (non-static, non-rigid motion) | Traditional SfM assumes static scenes; moving or deformable objects are treated as outliers. | Even with feature-metric refinement, rigid-scene assumptions fail; motion causes inconsistent correspondences and reconstruction errors. | Introduce dynamic NeRF, scene-flow-based SfM, or implicit dynamic field modeling to capture motion and deformation over time. | Partial. Dynamic NeRFs and neural SDF-flow methods partially address non-rigid motion. |
| 2. Illumination / Appearance / Time Variance | CNN features degrade under extreme lighting, weather, or long-term changes. | Feature spaces are static and lack temporal conditioning or appearance adaptation. | Develop time-conditioned encoders, appearance flow fields, or illumination-invariant feature spaces within SfM. | Partial. Some dynamic NeRF variants model lighting or appearance change, but classical SfM lacks such temporal modeling. |
| 3. Extreme Viewpoint / Wide Baseline | Large viewpoint changes break local feature consistency and matching stability. | Descriptors cannot generalize across large baselines, occlusions, or drastic view changes. | Combine semantic, language, or diffusion priors for semantic-aware SfM that matches beyond local appearance. | Partial. Semantic-aware 3D reconstruction and NeRF methods show progress, but integration with classical SfM remains rare. |
| 4. SparseβDense Gap | SfM yields sparse geometry; dense methods (MVS) use incompatible representations. | Sparse and dense optimization objectives differ, preventing unified reconstruction. | Employ unified implicit fields (feature fields, SDF, Gaussian splatting) that bridge sparse and dense representations. | Partial / Emerging. Implicit and Gaussian-based fields begin to unify sparseβdense paradigms. |
| 5. GeometryβSemantic Alignment | Traditional SfM reconstructs only geometry, ignoring semantic consistency. | Lacks semantic identity or part-level alignment, limiting high-level scene understanding. | Integrate vision-language or semantic embeddings (e.g., CLIP, DINOv3) and enforce cross-view semantic regularization. | Partial. Semantic-aware 3D reconstruction is growing but still limited for geometry-based SfM. |
| 6. Long-Term Consistency & Memory | Per-scene optimization causes drift; long-term or cross-session consistency is absent. | No temporal memory; reconstructions over time remain inconsistent. | Incorporate state-space models, latent geometry flow, or temporal latent dynamics for consistent long-term modeling. | Partial / Emerging. Some dynamic NeRFs and latent-flow models handle temporal coherence, but not integrated into SfM frameworks. |
Projection for the Semantic Prior
| Head Type | Alignment Level | Semantic Context | Output | Computation | Recommended Use |
|---|---|---|---|---|---|
| Linear Head | Patch-wise | Local semantics | 3D patch blocks | β‘ Fast (O(NΓC)) | Gaussian Fur, fast inference |
| DPT Head | Multi-layer | Global + contextual | Dense 3D map | π§ 3β5Γ heavier | Full 4D reconstruction, tracking |
In the Latent Space - DINOv3 (ViT-S/14)
| Stage | Symbol | Description | Shape |
|---|---|---|---|
| Input image | β | [B, 3, H, W] | Β |
| Patch embedding | 14Γ14 patch size β flatten | [B, N, 384], where N = (H/14)*(W/14) | Β |
| Output feature dim | C_enc = 384 | semantic token embedding | Β |
St4RTrack (DUSt3R backbone)
| Parameter | Typical value (from dust3r/configs) | Meaning |
|---|---|---|
enc_embed_dim | 768 or 1024 | dimension of encoder tokens |
dec_embed_dim | 512 or 768 | dimension expected by the decoder |
patch_size | 16 | one token β 16Γ16 input pixels |
Navigation - Generalization
- 2025 - TANGO: Traversablility-Aware Navigation with Local Metric Control for Topological Goals
- policies can be trained with diffusions
2D Vision SSL Supervision
- 2024 - DINO-Foresight: Looking into the Future with DINO
- 2021 - DINO: Emerging Properties in Self-Supervised Vision Transformers
- 2024 - DINOv2: Learning Robust Visual Features without Supervision
- 2025 - DINOv3
3D / 4D Reconstruction and Tracking
- 2025 NeurIPS - Pixel-Perfect Depth with Semantics-Prompted Diffusion Transformers
- π 2022 - Multi-layer perceptron-based computer vision neural networks
- π 2020 - Learning 3D Reconstruction π in Function Space
- 2025 - St4RTrack
- 2025 - MonST3R
- 2025 - PhysX-Anything: Simulation-Ready Physical 3D Assets from Single Image
- 2025 - SAM 3D
- 2024 - Physically Compatible 3D Object Modeling from a Single Image
- 2024 - DUSt3R: Geometric 3D Vision Made Easy
- cute demo - 2025 - TwoSquared: 4D Reconstruction from 2D Image Pairs
- ICCV 2025 - AnyCalib: On-Manifold Learning for Model-Agnostic π Single-View Camera Calibration
- ICCV 2025, Multimodal Spatial Intelligence
- 2025 - Depth Anything 3: recovering the visual space from any views
- 2025 - This Camera System Can Focus on Everything, π Everywhere, All At Once
- π 2025- Wayve.ai
- 2024 - UniDepth: Universal Monocular Metric Depth Estimation
- 2024 - DiffusionDrive
- 2024 - SplatFields - Neural Gaussian Splats for Sparse 3D and 4D Reconstruction
-
2011 - High-quality passive facial performance capture using π anchor frames
- π 2025 - MapAnything: Universal Feed-Forward Metric 3D Reconstruction
- 2025 - Multi-layer perceptron-based computer vision neural networks
- 2021 - KiloNeRF: Speeding up Neural Radiance Fields with Thousands of Tiny MLPs
Conceptual Distinction
βDiffusion completion β Dual-branch interaction modeling.β Although Diffusion models are powerful at image or feature completion (e.g., inpainting, conditional generation), their fundamental goal differs from that of a dual-branch Transformer. Diffusion reconstructs plausible samples from noise (probabilistic generation), while a dual-branch Transformer explicitly learns correspondences and relationships between two inputs (deterministic interaction).
| Backbone Type | Example Models | Characteristics | Advantages | Limitations |
|---|---|---|---|---|
| U-Net (CNN) | DDPM, Stable Diffusion v1/v2 | Encoderβdecoder with skip connections | Strong local spatial modeling; efficient; easy to train | Limited global context; harder to scale |
| Vision Transformer (ViT) | DiT (Diffusion Transformer, 2023) | Patch-based transformer | Better global attention, high scalability, simple structure | Expensive to train; requires large data |
| Swin Transformer / Hierarchical ViT | UViT, UDiT | Combines U-Net hierarchy + ViT attention | Preserves multi-scale info while gaining Transformer power | More complex to tune |
| ConvNeXt / ResNet | Some lightweight diffusion variants | CNN-based but more modern than U-Net | High efficiency, simple to integrate | Limited generative flexibility |
| Graph Neural Network (GNN) | Molecular / 3D Diffusion models (e.g., EDM, GeoDiff) | Models relations over nodes/atoms | Structured data modeling (e.g., point cloud, molecules) | Not suited for grid data |
| Implicit MLP (NeRF-style) | Score-based NeRF / Diffusion Fields | Continuous signal representation | Excellent for 3D continuous spaces | Slow, not scalable for large images |
Dealing With Continuous 3D Input Datasets
+---------------------------+
| Discrete Token Sequence |
| ["age", "BMI", "asthma"] |
+------------+--------------+
|
v
+---------------------------+
| Token Embedding Matrix |
+---------------------------+
|
v
+---------------------------+
| Multi-Head Self-Attention |
| O(N^2) complexity |
+---------------------------+
|
v
+---------------------------+
| Feed-Forward Layer |
+---------------------------+
|
v
+---------------------------+
| Output: Token-to-Token |
| symbolic reasoning |
+---------------------------+
Continuous Modal Inputs Include:
- images
- depth maps
- 3D point clouds
- 3D meshes
- medical waveforms (ECG, PPG, Doppler)
-
neural biosignals (EEG/MEG)
- Issue: Discrete symbolic models cannot directly represent geometry.
Information Loss
Continuous Data (image/depth/EEG/point cloud)
|
v
+------------------------------+
| Continuous Encoder (CNN/ViT) |
+------------------------------+
|
v
Project to k "fake tokens"
|
v
+------------------------------+
| Traditional LLM Transformer |
| (expects symbolic tokens) |
+------------------------------+
|
v
LLM pretends to "understand"
β But geometry/topology lost
Attended Master Thesis Project Defense at CVG
- 01 Sep 2025 - π Reconstructing Complete Garments with Foundation Models
- Pattern Prediction on Fabric Recognition
- βGarments are both cultural artifacts and engineered products, but most generative models produce visuals that cannot be manufactured. This thesis introduces a foundation model for pattern-centric garment generation, where outputs are sewing patternsβpanels, seams, and annotationsβready for CAD and simulation. A new tokenizer and multimodal dataset enable structured decoding from text or image inputs in a unified framework. In parallel, we investigate fabric recognition from large-scale product data, underscoring the challenge of linking garment shape to material behavior. Experiments show state-of-the-art pattern prediction, strong generalization, and predictable scaling. Together, these contributions move digital fashion toward simulation-ready, fabrication-oriented design.β
- 2025 - AIpparel: A Multimodal Foundation Model for Digital Garments
- 2025 - Single View Garment Reconstruction Using Diffusion Mapping Via Pattern Coordinates
- 10 Sep 2025 - An Interactive, Foundation-Model-Empowered Video Annotation Interface for Constructing a Challenging Video Object Segmentation Dataset
- SAM 2, DINOv2, GPT-4o, π
real-time Annotation - demo - nutsh
- SAM 2, DINOv2, GPT-4o, π
- 02 Oct 2025 - VSLAM-LAB: A Comprehensive Framework for Visual SLAM Baselines and Datasets, pixi
- 2024 - Gaussian Splatting SLAM, demo
- 09 Oct 2025 - Controllable Visual Generation using 3D prior
- HIL D 55.2, ETH HΓΆnggerberg, 14:00
- StreamSplat: A Framework for Self-Supervised, Online Novel View Synthesis
- Friday Oct 10th, 15:00-15:30 Zurich Time
- (Uncertainty-Aware 3D Mapping, Monday, October 13th, Zoom)
- From Sensors to Solutions: Permanent Laser Scanning in π Environmental Monitoring
- Prof. Dr.-Ing. Daniel Czerwonka-SchrΓΆder, 22 Oct 2025, 4:45 p.m, HIL D 53, HΓΆnggerberg
- (iTwRL: Interactable Digital Twin for Reinforcement Learning, Nov 17th, 2pm, Zoom)
- Learning 3D Human Foundation Models: A Data Request, Prov. Siyu Tang, Mon, 24-Nov-2025, 13:00β14:00, HG D 16.2
- Computational methods to model human motion and behavior from visual inputs in real-world environments, π
Non-rigid 4D Reconstruction and Tracking
- Computational methods to model human motion and behavior from visual inputs in real-world environments, π
- Guided Monocular Depth Estimation, Mon, Dec 8th, 2pm, CNB G 110 and Zoom, Sophie Selgrad
During Training
| Implementation style | Number of backward calls | Gradient behavior | Characteristics |
|---|---|---|---|
(loss1 + loss2).backward() | 1 | Computes gradients jointly | Simpler, but combined gradients are less controllable |
Separate backward() calls | 2 | Computes gradients individually | Better suited for tasks requiring different weighting or multi-branch networks (e.g., semantic + diffusion loss) |
π Semantic vs. Photometric Consistency
- In traditional SfM / MVS / NeRF pipelines, pixel correspondence is established by enforcing photometric consistency across views:
- where $I_t(p)$ is the pixel intensity at location $p$ in frame $t$, and $w(p)$ is the projection of that pixel into the target frame $tβ$ using the estimated geometry.
- In contrast, DINOv3, built on the Vision Transformer (ViT), replaces raw pixel comparison with semantic feature consistency:
- where $f_{\text{DINO}}(\cdot)$ denotes patch-level semantic embeddings extracted by DINOv3.
- Because these $f_{\text{DINO}}$ features are patch-level and semantically stable, they remain consistent under viewpoint changes, illumination variations, and partial occlusionsβ enabling robust cross-frame and cross-view alignment beyond raw photometric matching.
- From 2D to 3D/4D reconstruction is a highly ill-posed inverse problem, Projection model:
-
where the projection operator $\Pi$ maps a 3D point in the world coordinate space to a 2D pixel on the image plane
-
The inverse problem is:
- Two-dimensional pixel observations alone cannot uniquely deduce the true three-dimensional structure, so traditional methods rely heavily on geometric priors and multi-view constraints
Traditional Pairwise Pipeline (O(TΒ²))
I1 ββ I2 ββ I3 ββ ... ββ IT
β β β β
(I1,I2), (I1,I3), (I2,I3), (I3,I4), ... (IT-1,IT)
βββββββββββββββ Dense Pairwise Matching ββββββββββββββββ
Anchor-based Pipeline (O(T))
I1 βββββββββββΆ I2
β β
β βΌ
ββββββββββββΆ I3
β β
β βΌ
ββββββββββββΆ I4
β β
β βΌ
ββββββββββββΆ IT
(fixed anchor frame)
β Each frame j only forms one pair (I1, Ij)
β Each pair predicts (X^1_j, X^j_j) in the same world coordinate
Diffusion Models vs. Flow Matching
| Problem Type | Diffusion Models | Flow Matching |
|---|---|---|
| Non-rigid deformation | Implicitly learns temporal consistency through noise-to-structure denoising | Requires explicit motion supervision |
| Photometric inconsistency | Learns semantic-level feature representations beyond pixel matching | Still relies on local intensity differences in continuous flow fields |
| Sparse-view limitation | Uses generative priors to βcompleteβ missing geometry | Requires sufficient observation constraints |
| Ill-posed inverse problem | Models latent structure distributions probabilistically via diffusion | Deterministic ODE mapping, sensitive to noise |
| Temporal consistency | Implicit diffusion process provides natural temporal smoothness | Lacks explicit temporal regularization |
Multi-View Matching
Problem Definition
- Given a set of frames
\(\{ I_1, I_2, \dots, I_T \},\) - and feature points extracted from each frame
\(\{ f_i^t \},\) - the goal is to find cross-frame correspondences
\(\pi: (f_i^t) \mapsto (f_j^{t+k}),\) - such that they represent the same real-world 3D point.
Why It Is NP-hard
-
This problem is equivalent to graph matching:
- Each frameβs feature points form a node set.
- Correspondences between frames are edges.
- Matches must satisfy both geometric (epipolar) and temporal consistency constraints.
- The optimal matching minimizes:
-
where $P_t$ is the projection matrix. When the number of views exceeds two, the search space grows exponentially. Multi-view matching can be reduced to the Quadratic Assignment Problem (QAP), which is a classical NP-hard problem.
Simplified and Practical Approaches
| Method | Principle | Time Complexity | Integration Module |
|---|---|---|---|
| Soft Attention Matching (Transformer) | Replace hard matching with attention weights | O(TΒ·NΒ²) | St4RTrack Encoder |
| Epipolar Constraint Filtering | Geometric pre-filtering before soft match | O(N log N) | Projection Stage |
| Hough-Voting / DINO Semantic Alignment | Use semantic token similarity for weakly supervised matching | O(N) per frame | DINO Semantic Prior |
| Graph Cut / Sinkhorn Normalization | Approximate discrete matching via differentiable assignment | O(NΒ³) | Differentiable Alignment |
Ground Truth for 4D Reconstruction
| Dataset | Scene Type | Provided Ground Truth | Used for Supervision |
|---|---|---|---|
| Point Odyssey (PO) | Dynamic scenes with known camera trajectory and per-vertex motion (long-term tracks) | 3D vertex positions over time (x, y, z, t) and camera extrinsics | β
Tracking Branch: uses 4D vertex trajectories β Reconstruction Branch: uses per-frame mesh depth |
| Dynamic Replica (DR) | Real scanned scenes with dynamic motion (camera + objects) | Known static/dynamic meshes and GT camera pose | β
Reconstruction: per-frame geometry β Tracking: vertex correspondences |
| Kubric | Synthetic multi-object scenes | Depth map, instance segmentation, camera pose | β Reconstruction only (depth/geometry); weak tracking |
Step-by-Step: Constructing GT Pointmaps
Example: Point Odyssey
Each frame provides the scene mesh vertices in world coordinates:
\[V_t = \{ v_k^t \in \mathbb{R}^3 \mid k = 1, \dots, N \}\]where each $v_k^t$ is a 3D vertex position at time $t$.
For each image pixel $(u, v)$, find its corresponding mesh vertex (via rasterization or ray casting):
\[X_t(u,v) = \text{mesh2image}(V_t)\]This is the ground-truth pointmap at time $t$.
1. Tracking Branch Supervision
For the same vertex across time $i \to j$:
\[X^i_j(u,v)^{GT} = V^j_k - V^i_k\]provided by the datasetβs 4D trajectories.
2. Reconstruction Branch Supervision
Per-frame depth maps or meshes provide supervision for:
\[X^j_j\]which represents geometry reconstruction at time $j$.
All GT pointmaps are expressed in a unified world coordinate frame, transformed using the first frameβs camera extrinsics.
Aligned Results β How They Are Computed
- During training and evaluation, predicted and GT pointmaps may differ in scale, rotation, or translation.
Alignment ensures they are comparable.
Step 1: Global Median Scale Alignment (default)
- For each sequence, the predicted and GT pointmaps are scale-normalized:
Step 2: SIM(3) Alignment (used in evaluation)
- In Tables 4 and 5, an additional similarity transform alignment (scale + rotation + translation) is applied using the Procrustes algorithm:
The aligned prediction is then used to report metrics such as APDβα΄° and EPE.
In summary:
- Ground-truth pointmaps are rasterized from dataset-provided 4D meshes, expressed in the world coordinate frame.
During evaluation, predictions are scale- or SIM(3)-aligned to these GT pointmaps before computing accuracy metrics.
Explicit vs. Implicit 3D Geometry Computation
| Aspect | Explicit (Geometric Pipeline) | Implicit (Neural Pipeline) |
|---|---|---|
| Example Setup | Suppose you have a video with 10 frames. | The same 10 frames are input to a neural model. |
| Pairwise Matching | You must construct 10 Γ 10 = 100 image pairs for feature matching. | No explicit pair construction β attention layers automatically learn inter-frame relations. |
| Feature Matching Cost | Each pair requires thousands of RANSAC iterations to reject outliers. | Correlation across all frames is learned once through self-attention and optimized end-to-end. |
| Pose Estimation | Each image pair needs PnP / Essential Matrix estimation to recover relative camera poses. | The network implicitly infers all camera poses from global attention and latent camera tokens. |
| Global Optimization | Requires Bundle Adjustment (BA) over thousands of variables (poses + 3D points). | A single forward pass of the network jointly refines all poses and points. |
| Computation Dependency | Each stage depends on the previous step (matching β pose β triangulation β BA), making the process sequential and non-parallelizable. | Entire pipeline is feed-forward; all operations are differentiable and GPU-parallelizable tensor computations. |
| Computational Complexity | Typically O(TΒ²) to O(TΒ³) due to pairwise matching and optimization across frames. | Approximately O(T) with Alternating-Attention (frame-wise + global), scalable to hundreds of frames. |
| Runtime | Minutes to hours, depending on number of frames and optimization steps. | Milliseconds to seconds for full reconstruction. |
| Memory Usage | High β needs to store large Jacobians, keypoints, and pairwise constraints. | Moderate β mainly token embeddings and attention maps. |
| Output | Camera intrinsics/extrinsics, sparse or dense 3D structure after optimization. | Cameras, depth maps, and dense world-space point maps produced directly from the network. |
| Parallelizability | Low β iterative geometric solvers are inherently serial. | High β all computations are matrix multiplications on GPU. |
| Interpretability | High (based on explicit geometry equations). | Lower interpretability β geometry is implicitly encoded in network weights. |
| Representative Methods | SfM, COLMAP, MVSNet, NeRF (explicit camera poses). | VGGT, St4RTrack, MapAnything (implicit world-frame prediction). |
DL For 3D Reconstruction
| Bottleneck Source | Limitation of Classical Geometry Methods | Deep Learning Improvement Strategy |
|---|---|---|
| Dynamic Scenes | Assume the scene is static | Introduce temporal modeling (RNN / GRU / Transformer) and learn deformation fields (e.g., D-NeRF, HyperNeRF) |
| Sparse Viewpoints | Insufficient view redundancy | Use pretrained priors, shape priors, or diffusion priors to fill in missing geometric information |
| Real-time Requirement | Optimization is slow and iterative | Replace optimization with feed-forward neural networks and learned depth estimators |
| Weak or No Supervision | Depend on accurate labels and calibration | Train via photometric consistency and self-supervised losses (e.g., Monodepth, NeuralRecon) |
| Complex Illumination and Reflection | Simplified lighting model (Lambertian assumption) | Learn implicit neural representations that model reflection and BRDF properties |
| Temporal Consistency | Treat each frame independently | Apply ConvGRU, flow matching, or diffusion-based temporal smoothing to maintain cross-frame consistency |
Representative Methods
| Model | Year | Core Idea |
|---|---|---|
| NeRF | 2020 | Learns an implicit volumetric rendering function \(f_\theta(x,d)\) to recover high-quality 3D scenes without traditional geometric optimization. |
| D-NeRF / HyperNeRF | 2021 | Extends NeRF by learning non-rigid deformations along the temporal dimension, enabling 4D dynamic reconstruction. |
| 4D-LRM / 4D-Fly | 2025 | Combines large-scale pretraining and temporal-consistency modules to reconstruct scenes from any view at any time. |
Random Matrix
- Eigenvalues of large random matrices are statistically distributed across different systems
Wignerβs Semicircle Law
-
Ο(Ξ») = (1 / 2ΟΟΒ²) β(4ΟΒ² β λ²), |Ξ»| < 2Ο
MarΔenkoβPastur Law
- $\rho(\lambda) = \frac{1}{2\pi\sigma^{2} c \lambda} \sqrt{(\lambda_{+} - \lambda)(\lambda - \lambda_{-})}, \quad \lambda_{\pm} = \sigma^{2}(1 \pm \sqrt{c})^{2}$
Benchmarks and SOTAs
1. 4D
- 2025 - St4RTrack
- 2025 - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
2. 3D
- 2024 - AGILE3D
- 2016 - COLMAP 1 - baseline 1
- 2025 - COLMAP 2 - baseline 2
Key Contributions
[2000s] Classical 2D Stitching
- Euclidean / Affine / Homography
- Used in panoramas, satellite mosaics, medical imaging
- Fast, lightweight, real-time
Euclidean (3 DOF)
β’ β β’
Rigid rotation + shift
|
v
Affine (6 DOF)
β’ β β¬
Parallel preserved (shear, scaling)
|
v
Projective (8 DOF)
β’ β ⬳
Perspective distortion (vanishing point)
-------------------------------------------------
|
v
[2010s] Multi-View Geometry
- SfM (Structure from Motion), SLAM
- SE(3) rigid motion + Bundle Adjustment
- Full 3D scene reconstruction (static environments)
|
v
[2020s] Neural Implicit Representations
- NeRF (Neural Radiance Fields)
- Gaussian Splatting, Dynamic NeRF
- Rich photorealistic 3D, supports dynamics
|
v
[2025 β ] Transformer & Foundation Models
- VGGT (Geometry β Transformer sequence modeling)
- DINOv3 (7B SSL backbone, dense visual features)
- Replaces manual geometry β universal representations
- Powers Pixel 10 AI (Gemini Nano + Tensor G5)
Some Related Art works
- 2025 - VGGT
- π 2023 - OpenScene - Open set
- 2018 - GQN - SSL, Neural scene representation
Neural Differential Equations
1. Core Idea
- Neural Differential Equations (NDEs) generalize neural networks to continuous depth.
- Instead of discrete layer updates, the hidden state evolves continuously over time according to an ordinary differential equation (ODE):
- The solution is obtained by integrating over time:
2. Comparison with Standard Neural Networks
| Property | Standard NN | Neural Differential Equation |
|---|---|---|
| Structure | Discrete layers | Continuous dynamics |
| Forward pass | $h_{k+1} = f_\theta(h_k)$ | $\frac{dh}{dt} = f_\theta(h,t)$ |
| Depth | Fixed | Continuous |
| Backpropagation | Chain rule | Adjoint sensitivity method |
| Interpretation | Layer mapping | Continuous-time dynamical system |
3. Training via the Adjoint Method
- Gradients are computed by solving an adjoint ODE backward in time:
- This allows memory-efficient gradient computation since intermediate states do not need to be stored
4. Variants
- Neural ODE: Deterministic dynamics
- Neural SDE: Stochastic systems with noise
- Neural PDE: Parameterized partial differential equations
- Hamiltonian NN: Conserves physical energy
- Controlled DE: Handles continuous control inputs
5. Physical Interpretation
- The function ( f_\theta ) acts as a learnable vector field that defines how the system evolves in time
- This enables learning unknown physical dynamics directly from data:
Speed up Your Training in Multiple Ways
| Category | Technique / Concept | Core Idea (One Line) | Example Hardware / Framework |
|---|---|---|---|
| Kernel Optimization | AI Kernel Auto-Tuning (Triton / AutoTVM) | Automatically generate and fuse optimal CUDA kernels for specific tensor shapes. | NVIDIA H100, PyTorch 2.5, Triton 2 |
| Memory Efficiency | FlashAttention 2 / Fused Ops | Combine attention and softmax in a single kernel to reduce memory and launch overhead. | A100 / H100 / RTX 4090 |
| Precision Optimization | FP8 / INT8 Quantization-Aware Training | Use ultra-low precision arithmetic with adaptive scaling for faster, energy-efficient training. | NVIDIA Hopper, AMD MI300X |
| Graph Compilation | TorchInductor / XLA / MetalFX Graph Capture | Compile dynamic graphs into optimized static kernels for faster runtime. | PyTorch 2.x, TPU v6e, Apple M4 Max |
| Diffusion Optimization | Fused Diffusion Sampling | Merge denoising and upsampling into one fused kernel for diffusion-based training. | DiffusionRefine, Stable Diffusion Turbo |
| Adaptive Fine-Tuning | LoRA / QLoRA / BitNet Adapters | Parameter-efficient fine-tuning for large models on limited hardware. | RTX 6000 Ada, M2 Ultra, Edge TPU |
| Sequence Modeling | Mamba SSM / Linear Attention | Replace quadratic attention with state-space or linear-time models for long-sequence efficiency. | Transformer Engine (H100), FlashMamba |
| Hardware Co-Design | Grace Hopper / TPU v6e / MI300X | Unified CPUβGPU/TPU memory design enabling zero-copy tensor access. | NVIDIA GH200, Google TPU Pods |
| Distributed Optimization | FSDP + ZeRO + NVLink 4.0 | Fully sharded data parallelism and high-speed interconnect for multi-GPU scaling. | DGX H100 Cluster, NVSwitch |
| Emerging Paradigm | Neural Compilation & Auto-Scheduling | Learn to generate compute graphs and schedule execution automatically. | TVM Unity, Modular Mojo |
Models
Points β Delaunay Triangulation (Triangles)
β βββββββββ
\ / \ /
βββββββββββββββ
/ \ \ / \
β ββββββββββββββ
Points β Voronoi Diagram (Cells)
β β β
ββΌββββββΌββββββΌβ
β Cell β Cell β
ββββΌβββββββΌβββββββΌβββ
β Cell β Cell β
ββΌββββββΌββββββΌβ
β β β
Points β Poisson / Ξ±-shapes (Smooth Surface)
ββββββββββββ
β± β²
β β
β² β±
ββββββββββββββββ
Points β Volumetric / TSDF (Voxel Grid)
βββββ
βββββ Each cube = voxel
βββββ
Points β Implicit Fields (SDF / NeRF)
f(x,y,z) = 0 β Surface
Continuous function learned by NN
"Shape emerges from equations"
Points β Modern Neural Models (GS / Transformer)
β Gaussian Splatting β soft blobs in 3D
β VGGT / MonST3R / PanSt3R β End-to-end feed-forward 3D/4D
β NeRF β Radiance fields, view-dependent rendering
Motion
Structure-from-Motion (SfM) Pipeline
Input: Multiple images (Image Sequence)
β
1οΈβ£ Feature Extraction
- Detect keypoints and compute descriptors
- Methods: SIFT, ORB, SuperPoint, D2-Net
β
2οΈβ£ Feature Matching
- Find correspondences across images
- Techniques: Nearest Neighbor, RANSAC, StereoGlue
β
3οΈβ£ Camera Motion Estimation
- Estimate relative poses using Essential / Fundamental Matrix
- Recover camera extrinsics (Rotation R, Translation t)
β
4οΈβ£ Triangulation
- Back-project matched points
- Compute 3D scene points (sparse point cloud)
β
5οΈβ£ Bundle Adjustment (BA)
- Global non-linear optimization
- Refine camera poses and 3D points
- Minimize reprojection error
β
6οΈβ£ Output
- Optimized 3D point cloud (sparse or dense)
- Camera trajectory (motion path)
Visual SLAM Pipeline
Input Images (RGB / RGB-D / Stereo)
β
Front-End Tracking
- Feature Extraction (ORB, SuperPoint)
- Feature Matching (KLT, StereoGlue)
- Motion Estimation (PnP, Essential Matrix)
β
Back-End Optimization
- Bundle Adjustment (BA)
- Sliding Window Optimization
β
Loop Closure
- Place Recognition
- Pose Graph Optimization
β
Mapping
- Sparse Map (Point Cloud)
- Dense Map (Depth / Voxel / Mesh)
- Semantic Map (Object / Scene Labels)
β
Output: Robust Trajectory + Map
Visual Computing
2D β 3D Projection World Multi-View Segmentation World
βββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββ
Pixel Point β Camera Intrinsics β Multi-View Image β Camera Extrinsics β
Corrected by Distortion β Project to 2D Align Views Consistently β Back-Project to 3D
β β β β
ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββββββ
β Pixel Coordβ β β Metric Ray β vs. β Seg. Mask β β β 3D Point Cloud β
β (u,v) β β (K Matrix) β β (2D Image) β β or Voxels β
ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββββββ
β β β β
Distortion-Free Accurate Geometry Consistent 3D Semantic Labels
Projection Pixel β Metric Space Reconstruction in 3D Space
Summary:
1. Intrinsics: Ensure pixels map to correct metric coordinates
2. Extrinsics: Align multi-view cameras consistently
3. Distortion Params: Correct lens errors
4. Projection: World point β Image point
5. Back-Projection: Pixel + depth β World point
6. Goal: Lift 2D segmentation masks into 3D semantic segmentation
Camera = Projector (2D Screen View)
Extrinsics = GPS for Camera Pose
Segmentation = Paint Mask that Becomes 3D Object
Classical SfM vs. VGGT
Classical SfM / MVS World VGGT World
βββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
Find Keypoints β Match Pairs β Drop Images β Transformer Thinks β
Estimate Pose β Triangulate β One Forward Pass β Geometry Pops Out
Optimize BA β Wait Forever (Pose, Depth, Points, Tracks in ms)
β β β
βββββββββββββββ ββββββββββββββββ βββββββββββββββββ ββββββββββββββββββ
β Feature β β β Epipolar β vs. β Transformer β β β Unified Outputsβ
β Matching β β Geometry β β Global Contextβ β (Pose+Depth+3D)β
βββββββββββββββ ββββββββββββββββ βββββββββββββββββ ββββββββββββββββββ
β β β β
Fragile Matches Heavy Optimization Robust Priors Instant Geometry
(SIFT/SuperPoint) (Bundle Adjustment) Learned Attention Feed-forward Only
Hybrid approaches:
1. Use classical SfM to bootstrap intrinsics β fine-tune with VGGT outputs
2. Combine hand-crafted geometry checks (epipolar) with learned global priors
Classical SfM = Puzzle Builder with Thousands of Pieces (slow, error-prone)
VGGT = Instant Polaroid Printer that Prints 3D (fast, all-in-one)
VGGT (Learning-driven)
Input Images
β
Patch Embedding (DINO)
β
Camera Tokens + Self-Attention
β
Feed-forward Transformer
β
Outputs (Direct Prediction)
β’ Intrinsics K
β’ Extrinsics [R | t]
β’ Depth Maps
β’ Point Maps
β’ 3D Tracks
Why Squared (L2) Loss
1. Mathematical
-
Squaring makes the error smooth, continuous, and differentiable, which is required for gradient-based optimization
-
We update parameters using gradient descent:
-
To perform this optimization, the loss (L) must be differentiable with respect to (\hat{a})
-
Define the loss function as:
- Then, the gradient is:
- This ensures,
- Continuous and smooth gradient direction
- Analytical update expression
- Compatibility with automatic differentiation (autodiff)
- If instead we used the absolute error (L1 norm):
- the gradient would be non-continuous at (e = 0), causing oscillation or instability during optimization.
2. Statistical
-
The squared loss corresponds to assuming Gaussian noise in the measurements
-
Assume the observation model:
- Then the likelihood function is:
- Taking the negative log-likelihood (Maximum Likelihood Estimation):
- Thus minimizing the squared loss is equivalent to Maximum Likelihood Estimation (MLE) under Gaussian noise
3. Optimization
-
The squared loss amplifies large errors and stabilizes convergence
-
Large residuals receive stronger penalties:
- Hence,
- Large errors are corrected faster (rapid early convergence)
- Small errors yield smaller gradients (smooth late convergence)
Visual Computing - Coursework
| Tuesday (Topic) | Thursday (Topic) |
|---|---|
| Introduction to SfM Overview of Structure-from-Motion, applications in photogrammetry, robotics, AR/VR. | The Multi-View Problem From 2D images to 3D geometry, role of camera models. |
| Image Features I Feature detection (SIFT, ORB). | Image Features II Feature description and matching. |
| Epipolar Geometry Essential matrix, Fundamental matrix. | RANSAC & Robust Estimation Outlier rejection in correspondences. |
| Camera Pose Estimation I PnP problem, intrinsics vs extrinsics. | Camera Pose Estimation II Homography, motion from two views. |
| Triangulation I Linear triangulation methods. | Triangulation II Non-linear triangulation and uncertainty. |
| Incremental SfM Sequential addition of cameras, growing reconstruction. | Global SfM Joint optimization across all cameras. |
| Bundle Adjustment I Definition and reprojection error. | Bundle Adjustment II Nonlinear least squares, LevenbergβMarquardt optimization. |
| Sparse vs Dense Reconstruction Difference between sparse SfM and dense MVS. | Multi-View Stereo (MVS) PatchMatch, depth map fusion. |
| Structure Representation Point clouds, meshes, voxel grids. | Surface Reconstruction Poisson surface reconstruction and variants. |
| SfM in Practice I COLMAP basics: input images, output formats. | SfM in Practice II COLMAP visualization and debugging reconstruction. |
| Limitations of Traditional SfM Drift, loop closure, scalability issues. | Robustness & Failures Low-texture scenes, repetitive patterns, robustness strategies. |
| Extensions I: Dynamic Scenes Non-rigid SfM, motion segmentation. | Extensions II: Large-Scale SfM City-scale and aerial 3D reconstruction. |
| Learning-based SfM I Deep feature matching (SuperGlue, LoFTR). | Learning-based SfM II Neural reconstruction pipelines (DUSt3R, VGGT). |
| Future of SfM From optimization-based to transformer-based methods. | SfM vs VGGT COLMAP vs VGGT, comparison of pros and cons. |
References 1
Frontiers in AI Research (2025)
- Long-Term Temporal & Structural Consistency
- Key Results:
- FlowFormer (CVPR β25): flow-matching for video coherence
- VideoMamba (25)
- MemoryNeRF (NeurIPS β24): implicit scene memory across seconds
- Opportunities:
- scalable frame-level memory modules
- layered geometric+semantic caching
- dynamic scene understanding
- Key Results:
π 2. Self-Supervised Learning from Extreme Sparsity
- Key Results:
SparseMAE (ICCV β23): masked autoencoding with <0.1 % tokens- Contrastive-Sparse (ICLR β24): adaptive masking focus on high-entropy regions
- Goals:
- near-fully-supervised performance with β°-level labels
- unified multi-task pretraining (classification, detection, generation)
π 3. DiT (Diffusion Transformer)
- Overview: Combines Transformer context modeling with diffusion denoising
- Examples
- KeyFace β speech-driven face animation via stepwise denoising
- DiffLocks β high-fidelity hair generation
- Pippo β multi-view rendering with geometric and texture coherence
- Benefit: Maintains character appearance/style across shots and supports conditional, coherent animation
4. Priors
- Synthetic Priors (GASP, SynShot)
- Generate βpseudo-realβ head avatars (poses, expressions, lighting) to enrich training data
- Improves generalization to extreme poses and rare expressions
- Diffusion-based Priors (CAP4D, GAF)
- Use pretrained diffusion models to produce high-quality 3D avatars or dynamic sequences
- Accelerates multi-view/multi-expression data generation and boosts video consistency
4D
- 2025 - 4DNex - 4d scene understanding
- 2024 - CAT4D - 4d Reconstruction from video
[ Multi-view Cameras + Rig Info ]
β
(Rig3R)
3D Scene Understanding
β
[ BEV / Map / Agent Context ]
β
(DiffusionDrive)
Multi-Modal Trajectory Generation
β
Control & Real-Time Driving
3D
- 2025 - VGGT
- π 2023 - OpenScene
- π 2024 - Segment3D
- 2023 - AGILE3D
- [COLMAP], [GLOMAP]
2D
- [ViT], [DINOv3], [SAM 3]
Some Products
- 2025 - RealityScan
- 3DV projects 2024
- [3DV projects 2025]
References
- 2023 - Point Cloud Pre-training with Diffusion Models
- 2025 - Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning
- 2021 - The fishyscapes benchmark: Measuring π blind spots in semantic segmentation
- 2025 - SNI-SLAM++: Tightly-Coupled π Semantic Neural Implicit SLAM
- 2025 - osmAG-LLM: Zero-Shot Open-Vocabulary Object Navigation via Semantic Maps and Large Language Models Reasoning
- 2025 - ProcGen3D: Learning Neural Procedural Graphs for Image-to-3D Reconstruction
- 2024 - Physics3D: Learning Physical Properties of 3D Gaussians via Video Diffusion
- 2022 - Understanding Uncertainty Maps in Vision with Statistical Testing
- 2025 - Pixels2Points: Fusing 2D and 3D Features for Facial Skin Segmentation