2025 - Master Thesis 2

A Feed forward network for 4d Semantic and Instance Segmentation


Topics


2018 - GQN

πŸ“ 2025 - MonST3R




Key Contributions


A FFN for 4d Segmentation - Semantic and Instance




Benchmarks and SOTAs


1. 4D





2. 3D


πŸ“ 2025 - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion


1. Geometry-Centric 3D Models

DUSt3R (CVPR 2024)

  • 2D Images β†’ Feature Matching β†’ 3D Structure

MASt3R (ECCV 2024)

  • Images β†’ 3D-Aware Matching β†’ Precise Geometry

VGGT (CVPR 2025)

  • Image Sequences β†’ Geometry-Grounded Attention β†’ 3D Pose & Structure

2. Semantic + Geometry Joint Models

SAM (ICCV 2023, Meta AI) / SAM 2 (2024, Meta AI

  • Video / 3D Stream β†’ SAM 2 Engine β†’ Consistent 2D/3D/4D Segmentation

PanSt3R (ICCV 2025)

  • Multi-View Images β†’ Fuse Masks β†’ 3D Segmented Scene

4D Panoptic Extensions (CVPR 2024, Ego-Exo4D)

  • Video β†’ 3D Panoptic + Time β†’ 4D Reconstruction



Why FFN


Task FFN (Feed Forward) Transformer
3D Semantic/Instance Segmentation PointNet/PointNet++ (CVPR’17, FFN-style MLP), Flood-Filling Networks (NeurIPS’17) Point Transformer (NeurIPS’21), 3DETR (ICCV’21), Point-BERT (CVPR’22)
4D Semantic/Instance Segmentation Spatio-temporal FFN (MICCAI’19), Dynamic CT/MRI segmentation (biomedical imaging) PanSt3R (CVPR’25, ETH), MonST3R (CVPR’25, Meta+ETH)
Industry (3D) Google Connectomics (3D neuron segmentation), NVIDIA medical imaging Meta Horizon (AR/VR), Tesla/Waymo/NVIDIA Drive (LiDAR segmentation)
Industry (4D) NVIDIA Isaac (real-time 4D perception / obstacle segmentation) Waymo 4D Panoptic, Meta Reality Labs (Aria: AR/VR dynamic segmentation)




Some Other References

2024 - AGILE3D

2016 - COLMAP 1 - baseline 1

2025 - COLMAP 2 - baseline 2



3. 2D


2025 - DINOv3 - checkpoints






Some Other topics


2024 - soft MOE

2025 - Probabilistic Methods for Monocular 3D Human Reconstruction

2025 - minFM: Minimal Flow Matching

NanoGPT (124M) in 3 minutes

Aug 2025 - Proxies Could Be The Key To Interacting With Physical Objects In Mixed Reality



World Models / Reality Proxy


Mahcine Learning Street Talk

3d gaussian


Topics


[Implicit 3D Representations]

2021 - D-NeRF

2025 - TetWeave

C++ lib repo - toolkit



Shape Modeling


2025 - TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes





3D Reconstruction

Dimension 3D Gaussian Splatting Explicit Grids/Voxels (e.g., Instant-NGP) Tensor Factorization (TensoRF) Implicit SDF (NeuS family) Dynamic/4D Scenes (Tensor4D)
Speed Near real-time Real-time to ultra-real-time Real-time to near real-time Moderate to slow Moderate
Storage Moderate (tens of MB) High (hundreds of MB) Low (a few MB) Low–moderate (a few MB) Low–moderate
Geometry Quality Continuous, high-fidelity Discrete, voxelized feel Good Excellent (smooth meshes) As static plus temporal coherence
Color/Lighting Includes per-splat color Includes (via textures or per-voxel) Usually includes color No (needs separate texture/renderer) Matches static, adds time dimension
Ease of Use Simple model, flexible multi-source fusion Mature pipelines, relies on explicit structures Requires tensor math Requires supervised SDF training / volumetric rendering Requires temporal decomposition design



Mesh-VAE World                          Implicit Geometry World
═══════════════════════════════         ══════════════════════════════════
Mold Shape  β†’  Fill Cream  β†’           Pour Batter β†’ Let Shape Form β†’  
Keep Shape  β†’  Adjust Icing            Implicitly Shape via Function
(Topology)     (Latent Codes)          (SDF / NeRF Fields)
     ↓                ↓                         ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cake Mold  β”‚β†’ β”‚ Cream Code β”‚    vs.    β”‚  Batter    β”‚β†’ β”‚ Shape Function β”‚
β”‚ (Mesh Topo)β”‚  β”‚ (Latent z) β”‚           β”‚ (No Mesh)  β”‚  β”‚ f(x) β†’ Geometryβ”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓                ↓                         ↓               ↓
Consistent Shape   Editable Details         Any Shape      Learned Surface
Fixed Faces        Vertex Offsets           Continuous     Surface = f(x)=0


Hybrid models:
1. Use Mesh-VAE to encode coarse shape β†’ condition NeRF/SDF to model fine detail
2. Combine structural control (mesh) with detail realism (fields)

🍨 NeRF = Gelato Machine with View-Conditioned Flavor Control
πŸ—οΈ SDF = Invisible Sculptor Guided by Distance and Space Curvature



Visual Computing

Tuesday (Topic) Thursday (Topic)
Introduction to SfM
Overview of Structure-from-Motion, applications in photogrammetry, robotics, AR/VR.
The Multi-View Problem
From 2D images to 3D geometry, role of camera models.
Image Features I
Feature detection (SIFT, ORB).
Image Features II
Feature description and matching.
Epipolar Geometry
Essential matrix, Fundamental matrix.
RANSAC & Robust Estimation
Outlier rejection in correspondences.
Camera Pose Estimation I
PnP problem, intrinsics vs extrinsics.
Camera Pose Estimation II
Homography, motion from two views.
Triangulation I
Linear triangulation methods.
Triangulation II
Non-linear triangulation and uncertainty.
Incremental SfM
Sequential addition of cameras, growing reconstruction.
Global SfM
Joint optimization across all cameras.
Bundle Adjustment I
Definition and reprojection error.
Bundle Adjustment II
Nonlinear least squares, Levenberg–Marquardt optimization.
Sparse vs Dense Reconstruction
Difference between sparse SfM and dense MVS.
Multi-View Stereo (MVS)
PatchMatch, depth map fusion.
Structure Representation
Point clouds, meshes, voxel grids.
Surface Reconstruction
Poisson surface reconstruction and variants.
SfM in Practice I
COLMAP basics: input images, output formats.
SfM in Practice II
COLMAP visualization and debugging reconstruction.
Limitations of Traditional SfM
Drift, loop closure, scalability issues.
Robustness & Failures
Low-texture scenes, repetitive patterns, robustness strategies.
Extensions I: Dynamic Scenes
Non-rigid SfM, motion segmentation.
Extensions II: Large-Scale SfM
City-scale and aerial 3D reconstruction.
Learning-based SfM I
Deep feature matching (SuperGlue, LoFTR).
Learning-based SfM II
Neural reconstruction pipelines (DUSt3R, VGGT).
Future of SfM
From optimization-based to transformer-based methods.
SfM vs VGGT
COLMAP vs VGGT, comparison of pros and cons.





References 1


Frontiers in AI Research (2025)

  1. Efficient Multimodal Alignment & Generation
    • Key Results:
      • CLIPDraw++ (NeurIPS ’24): unified vision–language alignment
      • Video-LLaMA (ICLR ’25): zero-shot text-to-video generation
    • Challenges: real-time deployment, fine-grained controllability, safety/robustness

πŸ“ 2. Long-Term Temporal & Structural Consistency

  • Key Results:
    • FlowFormer (CVPR ’25): flow-matching for video coherence
    • VideoMamba (25)
    • MemoryNeRF (NeurIPS ’24): implicit scene memory across seconds
  • Opportunities:
    • scalable frame-level memory modules
    • layered geometric+semantic caching
    • dynamic scene understanding

πŸ“ 3. Self-Supervised Learning from Extreme Sparsity

  • Key Results:
    • SparseMAE (ICCV ’23): masked autoencoding with <0.1 % tokens
    • Contrastive-Sparse (ICLR ’24): adaptive masking focus on high-entropy regions
  • Goals:
    • near-fully-supervised performance with ‰-level labels
    • unified multi-task pretraining (classification, detection, generation)
  1. Differentiable Physics & Hybrid Simulation
    • Key Results:
      • DiffPhys (NeurIPS ’24): end-to-end differentiable physics engine
      • FluidNeRF (CVPR ’25): fluid simulation within NeRF framework
    • Directions:
      • trainable raytracing and material modules
      • learned+classical simulator hybrids
      • transferable β€œphysical basis” representations
  2. Verifiable Robustness & Explainable Security
    • Key Results:
      • Certified Diffusion Robustness (ICLR ’25)
      • Provable Transformer Defenses (NeurIPS ’24)
    • Imperatives:
      • certified adversarial bounds
      • causal traceability in generation/decision chains
      • end-to-end system-level trust guarantees

πŸ“ 1. DiT (Diffusion Transformer)

  • Overview: Combines Transformer context modeling with diffusion denoising
  • Examples
    1. KeyFace – speech-driven face animation via stepwise denoising
    2. DiffLocks – high-fidelity hair generation
    3. Pippo – multi-view rendering with geometric and texture coherence
  • Benefit: Maintains character appearance/style across shots and supports conditional, coherent animation

πŸ“ 2. Diadic Models

  • Concept: Model both speaking and listening behaviors for interactive avatars
  • Examples
    • INFP / DualTalk: dual-branch networks for speaker lip sync and listener micro‐expressions
  • Insight: Ensures consistent identity/style in extended dialogues by modeling two-way interaction

3. Priors

  • Synthetic Priors (GASP, SynShot)
    • Generate β€œpseudo-real” head avatars (poses, expressions, lighting) to enrich training data
    • Improves generalization to extreme poses and rare expressions
  • Diffusion-based Priors (CAP4D, GAF)
    • Use pretrained diffusion models to produce high-quality 3D avatars or dynamic sequences
    • Accelerates multi-view/multi-expression data generation and boosts video consistency

4. Implications

  • Architecture: Adopt DiT’s diffusion-Transformer for cross-scene realface rendering
  • Interaction Consistency: Integrate diadic modeling to handle speaking and listening coherently
  • Memory Extension: Add a latent memory module to preserve character traits across sessions



References 2


4D

πŸ“ 2025 - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

2025 - Shape of Motion

2D

[DINOv3]

[SAM 2]


3D

2025 - VGGT

2023 - OpenScene

2024 - Segment3D

2023 - AGILE3D

[COLMAP], [GLOMAP]





References