4D Vision Foundation Models - 25


Topics


Artificial Neural Networks


Colah’s Blog





3D Vision & Multi-modality x Foundation Models





References


2025 - FAM-HRI: Foundation-Model Assisted Multi-Modal Human-Robot Interaction Combining Gaze and Speech


Model Input Data Type Domain Extendable to 3D?
DINOv2 2D image 2D CV Yes — Can serve as a visual token encoder for NeRF or 3D Transformers
Grounding DINO 2D image + text 2D CV No — Not natively designed for 3D bounding boxes
SAM2(Segment Anything 2) 2D image 2D CV Yes — Extendable to video segmentation and can integrate depth
Segment Anything 3D Point cloud + RGB 3D Vision Yes — Native 3D segmentation model
OpenScene / OpenMask3D RGB + depth 3D + 2D CV Yes — Multimodal cross-domain visual perception



Application Task Model Type Recommended Models
Eye Gaze + Image Analysis (e.g., FAM-HRI) 2D CV DINOv2 / SAM2 / Grounding DINO
Pick-and-Place Robot Manipulation 2D → 3D Hybrid CV-based 2D mask projected into 3D (e.g., SAM2 + RGB-D depth)
3D Reconstruction / Point Cloud Segmentation 3DV PointNeXt, Point-BERT, OpenScene
AR Glasses + Real-Time Semantic Understanding Primarily 2D CV SAM2 + DINOv2 + GPT (e.g., FAM-HRI stack)
SLAM / Robot Navigation Strictly 3DV ORB-SLAM3, MapTR, NeRF, MVSNet





References



1. Collatz Conjecture / 3n + 1 problem / hailstone problem


1950 - Lothar Collatz “Über die Differenzengleichung xₙ₊₁ = aₙ xₙ + bₙ”, Mathematische Nachrichten, Bd. 3 (1950).

2003 - The 3x+1 problem: An annotated bibliography (1963–1999) (sorted by author)

2017 - Collatz Conjecture in Color - Numberphile



2.📍📍 Erwin: A Tree-based Hierarchical Transformer for Large-scale Physical Systems [ICML’25]

repo



3. Self-Distillation Loss + Diffusion Loss

Representation Robustness  ←─── TRADE‑OFF ───→  Training Efficiency & Reuse
          ▲                                         ▲
          │                                         │
  Self‑Distillation Loss                 Pretrained Priors (Stable Diffusion, DINOv2)
  • Enforces invariance across           • Inject rich 2D visual semantics
    views/augmentations                  • Accelerate convergence, improve transfer
  • Stabilizes training dynamics
          │                                         │
          ▼                                         ▼
                  Diffusion Loss (Denoising Objective)
        • Learn to reverse a noise process on 3D data (e.g., point clouds)
        • Yields generative and noise‑robust latent features
          │                                               │
          └──────────── PROPOSED INTEGRATION ─────────────┘
                               │
                               ▼
                 Unified, High‑Quality 3D Latent Space
     → Better downstream performance (classification, segmentation, reconstruction)
     → Strong generalization from 2D/3D synergy



4. Global Consistency under Local Constraints

2025 - On Weak Approximation of Reductive Groups over Higher Dimensional Function Fields


Notion / Result One-liner (math) Key object / tool AI/ML analogue Actionable in AI/ML
Weak Approximation (WA) Can one global point $g\in G(K)$ simultaneously approximate finitely many local targets in $\prod_{v\in S}G(K_v)$? Linear algebraic group $G$ over $K=k(X)$ Multi-view / multi-client consensus: fit one global model that matches several local constraints Diagnose when a single global model can satisfy selected per-view/per-client specs (e.g., federated constraints, multi-sensor alignment)
Paper’s Core Move (Li–Liu–Pan, 2025) Abelianize a non-abelian WA problem via a 2-term torus complex and duality; quantify exactly how/why WA fails. Two-term complex $C=[T_{\mathrm{sc}}!\to T]$; dual $\tilde C=[\tilde T!\to \tilde T_{\mathrm{sc}}]$ Introduce “dual variables” so feasibility/consistency becomes checkable Turn “can we meet all local specs?” into certifiable obstructions (like KKT-style certificates for global inconsistency)
Perfect Global Duality $\Sha^{1}(K,C)\times \Sha^{d+1}(K,\tilde C)\to \mathbb{Q}/\mathbb{Z}$ is a perfect pairing of finite groups. Tate–Shafarevich groups of $C,\tilde C$ Two complementary error ledgers that balance: structure vs. data If either ledger is zero, certain global inconsistencies cannot occur—use as a pass/fail test before expensive training/fusion
Exact Sequence (Defect of WA) $1!\to!\overline{G(K)}!\to!\prod_{v\in S}!G(K_v)!\to!\Sha^{d+1}_S(K,\tilde C)^{!*}!\to!\Sha^{1}(K,C)!\to!0$ Local/global cohomology + Pontryagin dual Gap = certificate: leftover after best global fit equals a dual obstruction If the dual group is nonzero, stop early: no single model will fit all chosen views—relax constraints or re-weight
Finiteness & Local Duality Shows $\Sha^{d+1}_S(K,\tilde C)$ is finite; gives perfect local pairings $H^0(K_v,C)^{\wedge}!\times!H^{d+1}(K_v,\tilde C)\to\mathbb{Q}/\mathbb{Z}$. Poitou–Tate over higher-local fields Bounded inconsistency: the defect is small & enumerable Enables bounded search/repair: enumerate minimal relaxations to regain feasibility
When WA holds If $\Sha^{d+1}_\omega(K,\tilde C)=0=\Sha^{1}(K,C)$, then $G$ has WA. Vanishing criteria under paper’s hypotheses Green light for consensus Safe to commit to a single global parameterization across selected views/clients
Strong Approximation (SA) Density in the restricted product away from $S$. Adelic viewpoint “Almost everywhere” fit For large-scale pipelines, SA ≈ treat global model as matching all but finitely many sites
Integral (S-integral) Approximation Density of $X(\mathcal O_{K,S})$ in $\prod_{v\notin S}X(\mathcal O_v)$. Integral models, Brauer–Manin (integral) Hard constraints (discrete/quantized) Use when parameters must be integer/quantized; predict when hard budgets make consensus impossible
Hasse Principle Local solvability $\Rightarrow$ global solvability? Brauer–Manin obstruction Feasibility from all views Quick feasibility triage before training or fusion
Torsors / Homogeneous Spaces WA/HP for principal bundles and quotients reduce to the same duality framework. $H^1(K,G)$, descent Latent symmetry classes For gauge-equivalent or symmetric models, certify global stitchability across local gauges
Glossary (AI mapping) $v$: a view/client/site; $G(K_v)$: local feasible set; $G(K)$: single global hypothesis; closure $\overline{G(K)}$: best global you could ever reach; $\Sha$: obstruction group Local ⇄ Global bridge Read the exact sequence as a global-fit pipeline with a provable failure code



5



6





Articles


2024 - nGPT: Normalized Transformer with Representation Learning on the Hypersphere

2021 - Multimodal Neurons in Artificial NNs

2018 - The Building Blocks of Interpretability

2008 - Kernel methods in machine learning




References

1. Contrastive Loss

1993 NIPS - Signature Verification using a “Siamese” Time Delay Neural Network

2006 CVPR - Dimensionality Reduction by Learning an Invariant Mapping

Implementations

[CLIP]
[DALL·E 3]



2. Fusion

2011 - Multimodal Deep Learning

2019 - ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations

2019 - UNITER: UNiversal Image-TExt Representation Learning





The Topic


**Self-Distillation + Diffusion**
    ↓
[ 3D Diffusion Network ]
    ←— combine point‑cloud denoising loss with DINOv2‑style self‑distillation loss
    ↓
Enhanced 3D features

**2D → 3D Distillation**
    ↓
[ Multi‑View 2D Encoder (Stable Diffusion / DINOv2) ]
    ↓
[ Feature Projection & Distillation ]
    ←— align 2D features to 3D points via multi‑view correspondences
    ↓
3D backbone embeddings

**Overall Goal**
    ↓
[ Unified 3D Representation ]
    ←— fuses high‑fidelity geometry (diffusion) + strong discriminative cues (distillation)


Explore 3D Representation Learning by -> Combining Self-Distillation Loss + Diffusion Loss


DINOv2

  • Hybrid discriminative objective based on DINO + iBOT
    • Image-level (DINO) - Two networks (student, teacher) process different strong/weak enhancements of the same image, similar to the “prototype classification” cross entropy loss on the [CLS] token
    • Patch-level (iBOT) - The student inputs a masked patch to predict the corresponding unmasked patch of the teacher. Similar to Masked Image Modeling (MIM), it also uses classification-based prototype alignment


  • KoLeo regularization
    • The logarithmic penalty of the minimum adjacent distance within the batch encourages features to be evenly spread on the sphere


  • Training a large model and then distilling
    • First, use ViT‑g/14 (student–teacher architecture) with 1B parameters to self-supervise a large number of iterations on 142M images
    • Then distill the small models (ViT‑L, ViT‑B, ViT‑S) from the large model - use the same DINO objective, but replace the teacher with a frozen large ViT‑g, and train the student from scratch, which is fast and low-cost


-> Relying solely on self-supervised pre-training, can obtain general, out-of-the-box visual features that are on par with or better than large-scale weak supervision (such as CLIP)


Goals

  • Why pre-train 3D representations?
    • Goal - Before 3D downstream tasks (such as segmentation, classification, and registration), let the model be trained on large-scale unlabeled point clouds to obtain universal and robust feature representations.

    • Challenges - Point clouds do not have natural pixel grids like images, but are more sparse and disordered, requiring special network structures (PointNet, Point Transformer…) and task design.


  • Self-Distillation’s “Representation Learning” Advantages
    • Principle - Let the model act as both a “teacher” and a “student” at the same time, aligning features with each other through different augmented views (or different model branches)
    • Effect - Methods such as DINOv2 can learn very discriminative features for downstream segmentation/classification, and do not rely on labels


  • Diffusion Loss’ “Generative” Advantages
    • Principle - Gradually add noise to the unlabeled point cloud during the training phase, and then let the network learn to denoise at each noise level. The loss is generally the mean square error between the predicted noise and the real noise
    • Effect - The network learns both global and local “generative capabilities” and can capture high-fidelity distribution details


Improve 3D Pretraining Strategies by Combining Self-Distillation & Diffusion Losses

  • Combine the representation learning strengths of self-distillation models—such as DINOv2, which excels at segmentation and classification—with the high-fidelity feature capabilities of current generative diffusion models
  • Begin by fine-tuning a 3D point cloud diffusion model (e.g., PointDif) and integrating self-distillation losses inspired by DINOv2, or contrastive-like regularizers as introduced in Diffuse and Disperse
  • Then analyze the impact of these representation-enhancing losses on downstream tasks such as 3D segmentation and 3D classification



Possible Improvements

2025 - Efficient Distillation of Classifier-Free Guidance using Adapters

  • 1. A More Efficient Self-Distillation
    • Freeze the backbone: keep the original diffusion model (or self-distillation model) parameter θ unchanged
    • Insert lightweight adapter: add a small trainable module ψ after the key layer (such as Transformer’s attention block, PointNet’s MLP block, etc.)
    • Single forward simulated self-distillation: Adapter learns to approximate the “teacher model + guidance mechanism” (in AGD, it simulates Classifier-Free Guidance), so that reasoning only needs one forward pass, which can simultaneously retain the generation function and distillation signal


2022 - Flow Matching for Generative Modeling

  • 2. Flow Matching Loss vs. Diffusion Loss
    • Advantages
    • More stable convergence and fewer iterations: No longer fighting against noise randomness, FlowMatching often converges faster than ScoreMatching/DDPM
    • Fast sampling: Especially if you choose the OT path, you can generate samples with dozens of forward passes, which can theoretically reduce the number of inference steps by half or more
    • Challenges
    • Complexity of velocity field design: To achieve accurate mapping and ODE integration on unordered point clouds, additional engineering is required
    • Existing library support: Most 3D diffusion frameworks only implement random diffusion. To change to ODE, you need to connect to torchdiffeq or similar components yourself



References 1


2025 - How I Understand Flow Matching

Flow Matching - GIF

2015 - U-Net

[1] DINOv2: Learning robust visual features without supervision, Oquab et al. TMLR 2023

[2] High-resolution image synthesis with latent diffusion models, Rombach et al. CVPR 2022

[3] Point Cloud Pre-training with Diffusion Models, Zheng et al. CVPR 2024

[4] Diffuse and Disperse, Wang et al. ArXiv 2025

[5] A Tale of Two Features: Stable Diffusion Complements DINO for Zero-Shot Semantic Correspondence, Zhang et al. NeurIPS 2023

[6] 3D Scene Understanding with Open Vocabularies, Peng at al. CVPR 2023

[7] Harnessing Text-to-Image Diffusion Models for Point Cloud Self-Supervised Learning, Chen et al. ArXiv 2025



References 2



2015 - U-Net: Convolutional Networks for Biomedical Image Segmentation

2016 - V-Net: Fully Convolutional Neural Networks for Volumetric Medical Image Segmentation

2020 - nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation

2021 - TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation

2021 - UNETR: Transformers for 3D Medical Image Segmentation










Some other topics

Masked Flow Matching for Real-Time Signal Processing

2010 - Meta-learning for time series forecasting and forecast combination

Backbone: Masked Flow Matching + Bayesian Layers
– Variational analogs of convolutional/fully-connected layers (e.g. DenseVariational, BayesianLinear)
– Weights modeled by variational distribution $q(w)$


ELBO Loss
Minimize the negative Evidence Lower Bound:

\[\mathcal{L}_{\mathrm{ELBO}}(\theta, \phi) = \underbrace{\mathbb{E}_{w\sim q_\phi(w)}\bigl[\ell\bigl(f_w(x),\,y\bigr)\bigr]}_{\displaystyle\text{(1)Expected data loss}} \;+\; \underbrace{\lambda\;\mathrm{KL}\bigl(q_\phi(w)\,\|\,p(w)\bigr)}_{\displaystyle\text{(2)Variational Regularization}}\]


Inference via Weight Sampling
Perform $T$ stochastic forward passes with $w_t \sim q(w)$, then compute:
\(\mu(x) \;=\; \frac{1}{T}\sum_{t=1}^T f_{w_t}(x), \quad \sigma^2(x) \;=\; \frac{1}{T}\sum_{t=1}^T\bigl(f_{w_t}(x) - \mu(x)\bigr)^2\)



Decision Making
If $\sigma(x) > \tau$, trigger a fallback or alert



References

  • Bayesian Neural Nets / BNN





Study Notes


nGPT vs. Mamba

Deep Learning Training Challenge
─────────────────────────────────────────────────────────────────
Training Efficiency (Steps)  ←──── TRADEOFF ────→  Architecture Complexity
        │                                                      │
        ▼                                                      ▼
Simple architectures:                              Complex architectures:
Fast convergence but                               Slow convergence, needs
limited performance                                warmup, weight decay, etc.
        │                                                      │
        └─────────────── nGPT SOLUTION ────────────────────────┘
                                    │
                                    ▼
                        Hypersphere Geometry
                    4-20x faster training + architectural elegance
Sequence Processing Challenge  
───────────────────────────────────────────────────────────────
Computational Efficiency     ←──── TRADEOFF ────→    Context Understanding
        │                                                    │
        ▼                                                    ▼
RNNs: O(n) complexity                              Transformers: O(n²) complexity
Sequential processing                               Parallel processing
Limited context window                             Global attention
        │                                                     │
        └──────────────── MAMBA SOLUTION ─────────────────────┘
                                    │
                                    ▼
                          Selective State Space
                    O(n) complexity + global understanding


nGPT - Training Optimization

Solution: Mathematical reinterpretation
├── Transformer = Hypersphere optimizer
├── Normalization = Geometric constraints
└── Updates = Spherical interpolation


Mamba - Inference Efficiency

Solution: Architectural paradigm shift
├── Attention mechanism → State space model
├── Parallel computation → Selective recursion
└── Global perception → Compressed memory



“2020 - System Design Interview An Insider’s Guide” - Alex Xu - Ver_1.0


2024 - A Beginner’s Guide to System Design


Common Data Capacity Units (Power-of-Two Table)

| Unit Symbol | Name     | Definition                          | Exact Bytes         | Decimal Approximation |
| ----------- | -------- | ----------------------------------- | ------------------- | --------------------- |
| bit         | bit      | a single binary digit               | 1 bit               | —                     |
| B           | byte     | 8 bits                              | 8 bits              | —                     |
| KiB         | kibibyte | 2¹⁰ bytes = 1 024 bytes             | 1 024 B             | ≈ 1.02 KB (10³)       |
| MiB         | mebibyte | 2²⁰ bytes = 1 048 576 bytes         | 1 048 576 B         | ≈ 1.05 MB (10⁶)       |
| GiB         | gibibyte | 2³⁰ bytes = 1 073 741 824 bytes     | 1 073 741 824 B     | ≈ 1.07 GB (10⁹)       |
| TiB         | tebibyte | 2⁴⁰ bytes = 1 099 511 627 776 bytes | 1 099 511 627 776 B | ≈ 1.10 TB (10¹²)      |




image




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • State Spaces Models - 25
  • Neuroscience Basis For Hearing & Vision - 25
  • Structures for Speech Processing - 25
  • Speech Processing x Transfer Learning - 25
  • Models Post-Training - 25