2025 - Master Thesis 2
A Feed forward network for 4d Semantic and Instance Segmentation
Topics
Key Contributions
A FFN for 4d Segmentation - Semantic and Instance
Benchmarks and SOTAs
1. 4D
2. 3D
π 2025 - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
1. Geometry-Centric 3D Models
DUSt3R (CVPR 2024)
- 2D Images β Feature Matching β 3D Structure
MASt3R (ECCV 2024)
- Images β 3D-Aware Matching β Precise Geometry
VGGT (CVPR 2025)
- Image Sequences β Geometry-Grounded Attention β 3D Pose & Structure
2. Semantic + Geometry Joint Models
SAM (ICCV 2023, Meta AI) / SAM 2 (2024, Meta AI
- Video / 3D Stream β SAM 2 Engine β Consistent 2D/3D/4D Segmentation
PanSt3R (ICCV 2025)
- Multi-View Images β Fuse Masks β 3D Segmented Scene
4D Panoptic Extensions (CVPR 2024, Ego-Exo4D)
- Video β 3D Panoptic + Time β 4D Reconstruction
Why FFN
Task | FFN (Feed Forward) | Transformer |
---|---|---|
3D Semantic/Instance Segmentation | PointNet/PointNet++ (CVPRβ17, FFN-style MLP), Flood-Filling Networks (NeurIPSβ17) | Point Transformer (NeurIPSβ21), 3DETR (ICCVβ21), Point-BERT (CVPRβ22) |
4D Semantic/Instance Segmentation | Spatio-temporal FFN (MICCAIβ19), Dynamic CT/MRI segmentation (biomedical imaging) | PanSt3R (CVPRβ25, ETH), MonST3R (CVPRβ25, Meta+ETH) |
Industry (3D) | Google Connectomics (3D neuron segmentation), NVIDIA medical imaging | Meta Horizon (AR/VR), Tesla/Waymo/NVIDIA Drive (LiDAR segmentation) |
Industry (4D) | NVIDIA Isaac (real-time 4D perception / obstacle segmentation) | Waymo 4D Panoptic, Meta Reality Labs (Aria: AR/VR dynamic segmentation) |
Some Other References
2016 - COLMAP 1 - baseline 1
2025 - COLMAP 2 - baseline 2
3. 2D
Some Other topics
2025 - Probabilistic Methods for Monocular 3D Human Reconstruction
2025 - minFM: Minimal Flow Matching
Aug 2025 - Proxies Could Be The Key To Interacting With Physical Objects In Mixed Reality
World Models / Reality Proxy
Topics
[Implicit 3D Representations]
Shape Modeling
2025 - TetSphere Splatting: Representing High-Quality Geometry with Lagrangian Volumetric Meshes
3D Reconstruction
Dimension | 3D Gaussian Splatting | Explicit Grids/Voxels (e.g., Instant-NGP) | Tensor Factorization (TensoRF) | Implicit SDF (NeuS family) | Dynamic/4D Scenes (Tensor4D) |
---|---|---|---|---|---|
Speed | Near real-time | Real-time to ultra-real-time | Real-time to near real-time | Moderate to slow | Moderate |
Storage | Moderate (tens of MB) | High (hundreds of MB) | Low (a few MB) | Lowβmoderate (a few MB) | Lowβmoderate |
Geometry Quality | Continuous, high-fidelity | Discrete, voxelized feel | Good | Excellent (smooth meshes) | As static plus temporal coherence |
Color/Lighting | Includes per-splat color | Includes (via textures or per-voxel) | Usually includes color | No (needs separate texture/renderer) | Matches static, adds time dimension |
Ease of Use | Simple model, flexible multi-source fusion | Mature pipelines, relies on explicit structures | Requires tensor math | Requires supervised SDF training / volumetric rendering | Requires temporal decomposition design |
Mesh-VAE World Implicit Geometry World
βββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββ
Mold Shape β Fill Cream β Pour Batter β Let Shape Form β
Keep Shape β Adjust Icing Implicitly Shape via Function
(Topology) (Latent Codes) (SDF / NeRF Fields)
β β β
ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββββββ
β Cake Mold ββ β Cream Code β vs. β Batter ββ β Shape Function β
β (Mesh Topo)β β (Latent z) β β (No Mesh) β β f(x) β Geometryβ
ββββββββββββββ ββββββββββββββ ββββββββββββββ ββββββββββββββββββ
β β β β
Consistent Shape Editable Details Any Shape Learned Surface
Fixed Faces Vertex Offsets Continuous Surface = f(x)=0
Hybrid models:
1. Use Mesh-VAE to encode coarse shape β condition NeRF/SDF to model fine detail
2. Combine structural control (mesh) with detail realism (fields)
π¨ NeRF = Gelato Machine with View-Conditioned Flavor Control
ποΈ SDF = Invisible Sculptor Guided by Distance and Space Curvature
Visual Computing
Tuesday (Topic) | Thursday (Topic) |
---|---|
Introduction to SfM Overview of Structure-from-Motion, applications in photogrammetry, robotics, AR/VR. |
The Multi-View Problem From 2D images to 3D geometry, role of camera models. |
Image Features I Feature detection (SIFT, ORB). |
Image Features II Feature description and matching. |
Epipolar Geometry Essential matrix, Fundamental matrix. |
RANSAC & Robust Estimation Outlier rejection in correspondences. |
Camera Pose Estimation I PnP problem, intrinsics vs extrinsics. |
Camera Pose Estimation II Homography, motion from two views. |
Triangulation I Linear triangulation methods. |
Triangulation II Non-linear triangulation and uncertainty. |
Incremental SfM Sequential addition of cameras, growing reconstruction. |
Global SfM Joint optimization across all cameras. |
Bundle Adjustment I Definition and reprojection error. |
Bundle Adjustment II Nonlinear least squares, LevenbergβMarquardt optimization. |
Sparse vs Dense Reconstruction Difference between sparse SfM and dense MVS. |
Multi-View Stereo (MVS) PatchMatch, depth map fusion. |
Structure Representation Point clouds, meshes, voxel grids. |
Surface Reconstruction Poisson surface reconstruction and variants. |
SfM in Practice I COLMAP basics: input images, output formats. |
SfM in Practice II COLMAP visualization and debugging reconstruction. |
Limitations of Traditional SfM Drift, loop closure, scalability issues. |
Robustness & Failures Low-texture scenes, repetitive patterns, robustness strategies. |
Extensions I: Dynamic Scenes Non-rigid SfM, motion segmentation. |
Extensions II: Large-Scale SfM City-scale and aerial 3D reconstruction. |
Learning-based SfM I Deep feature matching (SuperGlue, LoFTR). |
Learning-based SfM II Neural reconstruction pipelines (DUSt3R, VGGT). |
Future of SfM From optimization-based to transformer-based methods. |
SfM vs VGGT COLMAP vs VGGT, comparison of pros and cons. |
References 1
Frontiers in AI Research (2025)
- Efficient Multimodal Alignment & Generation
-
Key Results:
- CLIPDraw++ (NeurIPS β24): unified visionβlanguage alignment
- Video-LLaMA (ICLR β25): zero-shot text-to-video generation
- Challenges: real-time deployment, fine-grained controllability, safety/robustness
-
Key Results:
π 2. Long-Term Temporal & Structural Consistency
-
Key Results:
- FlowFormer (CVPR β25): flow-matching for video coherence
- VideoMamba (25)
- MemoryNeRF (NeurIPS β24): implicit scene memory across seconds
-
Opportunities:
- scalable frame-level memory modules
- layered geometric+semantic caching
- dynamic scene understanding
π 3. Self-Supervised Learning from Extreme Sparsity
-
Key Results:
SparseMAE (ICCV β23): masked autoencoding with <0.1 % tokens
- Contrastive-Sparse (ICLR β24): adaptive masking focus on high-entropy regions
-
Goals:
- near-fully-supervised performance with β°-level labels
- unified multi-task pretraining (classification, detection, generation)
- Differentiable Physics & Hybrid Simulation
-
Key Results:
- DiffPhys (NeurIPS β24): end-to-end differentiable physics engine
- FluidNeRF (CVPR β25): fluid simulation within NeRF framework
-
Directions:
- trainable raytracing and material modules
- learned+classical simulator hybrids
- transferable βphysical basisβ representations
-
Key Results:
- Verifiable Robustness & Explainable Security
-
Key Results:
- Certified Diffusion Robustness (ICLR β25)
- Provable Transformer Defenses (NeurIPS β24)
-
Imperatives:
- certified adversarial bounds
- causal traceability in generation/decision chains
- end-to-end system-level trust guarantees
-
Key Results:
π 1. DiT (Diffusion Transformer)
- Overview: Combines Transformer context modeling with diffusion denoising
-
Examples
- KeyFace β speech-driven face animation via stepwise denoising
- DiffLocks β high-fidelity hair generation
- Pippo β multi-view rendering with geometric and texture coherence
- Benefit: Maintains character appearance/style across shots and supports conditional, coherent animation
π 2. Diadic Models
- Concept: Model both speaking and listening behaviors for interactive avatars
-
Examples
- INFP / DualTalk: dual-branch networks for speaker lip sync and listener microβexpressions
- Insight: Ensures consistent identity/style in extended dialogues by modeling two-way interaction
3. Priors
-
Synthetic Priors (GASP, SynShot)
- Generate βpseudo-realβ head avatars (poses, expressions, lighting) to enrich training data
- Improves generalization to extreme poses and rare expressions
-
Diffusion-based Priors (CAP4D, GAF)
- Use pretrained diffusion models to produce high-quality 3D avatars or dynamic sequences
- Accelerates multi-view/multi-expression data generation and boosts video consistency
4. Implications
- Architecture: Adopt DiTβs diffusion-Transformer for cross-scene realface rendering
- Interaction Consistency: Integrate diadic modeling to handle speaking and listening coherently
- Memory Extension: Add a latent memory module to preserve character traits across sessions
References 2
4D
π 2025 - MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
2D
[DINOv3]
[SAM 2]
3D
[COLMAP], [GLOMAP]