Model Structures - 25

Let’s take a look at the history of the Model Structures we’re using today.

Evolution of 3D Scene Representations

Period	Method	Representation	Advantages / Limitations
1980–2010s	SfM / MVS / Mesh	Explicit point clouds or polygonal meshes	Accurate but discrete; cannot represent complex appearance or soft surfaces
1990–2010s	IBR / Light Field	Sampled light rays or image interpolation	Highly photorealistic but lacks true 3D geometry; strong view dependence
2015–2019	Deep Implicit Fields	Implicit functions (Occupancy / Signed Distance Field)	Continuous and smooth geometry; no explicit color or reflectance modeling
2020–2022	NeRF family	Neural radiance fields (density + color)	Unified geometry and appearance; high fidelity but slow to train and render
2023–Now	3D Gaussian Splatting (3DGS)	Explicit point-based volumetric primitives (Gaussian ellipsoids with color, opacity, and anisotropy)	Extremely fast rendering and editing; preserves view consistency; but lacks strong geometry regularization and semantic understanding

📍 Core Continuous Modeling Methods In Computational Geometry

Method	Core Idea	Advantages	Applicability to Your Scenario
Parametric Splines / NURBS / B-Spline / T-Spline	Define surfaces or volumes using parameters ( (u,v) ) (or higher-dimensional equivalents). Geometry is represented via control points and basis functions (e.g., B-Splines).	High geometric precision and controllability; widely used in industrial CAD.	Works well for static, parameterizable objects, but limited for dynamic 4D or large-scale scenes.
Subdivision Surfaces	Generate smooth limit surfaces by repeatedly subdividing a coarse mesh (e.g., Catmull–Clark subdivision).	Provides mathematically well-defined limit surfaces; handles certain topological changes.	Effective for deformable shapes such as humans or cloth, but less flexible for time-varying or large-scale 4D tracking.
Implicit Continuous Fields (Neural or Analytical)	Represent geometry implicitly as a function ( f(x, y, z) = 0 ) or ( f(x, y, z, t) = 0 ), such as DeepSDF (Signed Distance Function) or OctField (adaptive implicit representation).	Extremely flexible; topology-free; can model dynamics; suitable for learning-based frameworks.	Highly promising for 4D reconstruction and tracking, especially when aiming for generalization across scenes.
Partial Differential Equations (PDEs) and Variational Methods	Enforce continuity (e.g., ( C^2 ) smoothness) of surfaces or volumes via PDEs, or use variational formulations to model continuous media.	Strong geometric continuity and clear mathematical properties.	Valuable when physical constraints or deformation laws are known, but computationally expensive for large-scale neural or self-supervised learning setups.

DL after Classic ML

Component	Description
Origin	Deep Learning was formalized in 1986 by Rumelhart, Hinton, and Williams with the invention of Backpropagation.
Key Idea	Learn hierarchical representations — from low-level edges to high-level concepts — through multiple neural layers.
Representation	Automatically extracts features from raw data (images, audio, text) instead of manual feature engineering.
Optimization	Trains large neural networks using gradient descent and backpropagation to minimize a defined loss.
Architecture	Stacks multiple nonlinear transformations (e.g., CNNs, RNNs, Transformers) to form deep computational graphs.
Generalization	Learns robust patterns that transfer to unseen data, aided by large datasets, GPUs, and regularization methods.
Impact / Use	Powers modern AI systems in vision (CNNs), language (Transformers), speech (RNNs), and generative models (Diffusion, GANs).

Deep Learning, Training, and Knowledge Distillation

Dimension	Deep Learning	Training	Knowledge Distillation
Objective	Learn multi-layer nonlinear function `f(x; θ)` to represent complex patterns.	Optimize a loss function from data.	Make the student model mimic the teacher’s output distribution and internal representations.
Input Information	Raw data `(x)`	`(x, y)`	`(x, y, T(x))`
Loss Function	Any differentiable objective.	Task loss `𝓛(f(x), y)`	`α 𝓛(f(x), y) + (1−α) KL(f(x) ∥ T(x))`
Supervision Source	Data itself.	Hard labels `(y)`.	Teacher outputs `(T(x))` + true labels `(y)`.
Entropy Characteristic	May be high or low depending on task.	Low-entropy one-hot supervision.	High-entropy soft targets (smoothed teacher outputs).
Optimization Process	BP + GD (Backpropagation + Gradient Descent).	BP + GD.	BP + GD with temperature scaling `τ`.
Application Goal	General representation learning.	Task-specific model fitting.	Model compression, knowledge transfer, or performance enhancement.
Output Features	Deep hierarchical representations.	Task predictions.	Balanced task accuracy and teacher–student alignment.

Deep Learning World                             Classical ML World
═══════════════════════════════════             ════════════════════════════════════
Raw Data → Multi-layer Network →                Handcrafted Features → Shallow Model →
Learn Representations → Optimize by Gradient    Manual Design → Limited Adaptability
     ↓                     ↓                           ↓                     ↓
┌───────────────┐   ┌─────────────────┐           ┌─────────────────┐   ┌───────────────────────┐
│ Raw Inputs    │ → │ Neural Layers   │    vs.    │ Engineered Feats│ → │ Classifier (SVM/Tree) │
│ (Image/Text)  │   │ (CNN/RNN/Trans.)│           │ (HOG/SIFT/MFCC) │   │ (Fixed Decision Rules) │
└───────────────┘   └─────────────────┘           └─────────────────┘   └───────────────────────┘
     ↓                     ↓                           ↓                     ↓
End-to-End Learning     Automatic Feature Hierarchy   Manual Tuning Needed   Poor Transferability
(Backprop + Gradient)   (Low→Mid→High Abstractions)   (Domain-Specific)      (Retrain for New Task)

Hybrid approaches:
1. Use pretrained deep features + classical models for fast adaptation  
2. Fine-tune deep backbones with task-specific heads for efficiency  

Deep Learning = Student who learns concepts from examples (automatic understanding)  
Classical ML  = Student who uses fixed formulas (must be told what features matter)

Generalization Ability

Unsupervised World                              Zero-Shot World
═══════════════════════════════════             ════════════════════════════════════
No Labels → Discover Patterns →                 Pretrained Knowledge → New Task →
Cluster / Reduce Dim → Build Representations    Direct Prediction → Works Instantly
     ↓                     ↓                           ↓                     ↓
┌───────────────┐   ┌─────────────────┐           ┌─────────────────┐   ┌───────────────────────┐
│ Data Patterns │ → │ Learned Embeds  │    vs.    │ Language / Text │ → │ Recognize Unseen Task │
│ (Raw Inputs)  │   │ (Structure Only)│           │ Semantic Priors │   │ (Zero Examples)       │
└───────────────┘   └─────────────────┘           └─────────────────┘   └───────────────────────┘
     ↓                     ↓                           ↓                     ↓
Unclear for Tasks      Needs Extra Step            Direct Generalization   Immediate Usability
(PCA/K-means/SimCLR)   (Downstream Fine-tune)      (CLIP, GPT)             (Zero-shot QA/CLS)

Hybrid approaches:
1. Learn unsupervised embeddings → map to semantic space for zero-shot transfer
2. Combine raw pattern discovery with pretrained knowledge for stronger generalization

Unsupervised = Tourist wandering a city with no map (discover zones by yourself)  
Zero-Shot   = Tourist with a guidebook (instantly spot city hall & cathedral)

Why Deep Structure

Compared with the original machine learning models:
Linear Regression / SVM / Shallow Decision Trees
Deep structures refer to neural networks with Multiple Layers of Nonlinear Transformations
how to find some other temporal modeling way for the Non-linear Transformation?

Content

Transformer
Mamba
GPT
Tokenization
ARIMA
RNN, LSTM, GRU
Diffusion Models
Flow Matching
Quantization / Adapter Guided - LoRA + QLoRA

These deep models are capable of Learning Hierarchical Features, where each layer captures increasingly abstract representations of the data.

Local Minimal vs. Saddle Point

Knowledge Map

In practice, “Deep” means:

More than 3–4 layers in fully connected networks
10+ layers in convolutional networks
Or even hundreds of layers in modern transformers like GPT

Key Structures

MLP
- Multilayer Perceptron
- Feedforward fully connected networks
- Used in classification, regression, or small-scale tabular/audio tasks
- 1989 - Universal Approximation Theorem / Still used as light head in multimodal systems
📍 RNN -> LSTM
- When inputs are sequences
- Hochreiter & Schmidhuber 1997 - LSTM
Some Other 📍 Temporal Modeling
- GRU
- ConvGRU
- DynamicLSTM
- GatedGRU
CNN
- Convolutional Neural Networks
- When inputs are images or grid-like data
- Extracts spatial features, widely used in image/audio tasks
- Fully Connected Layer -> Receptive Field -> Parameter Sharing -> Convolutional Layer
- 1998 - LeNet / 2012 - AlexNet: ImageNet Classification with Deep Convolutional Neural Networks
Transformer
- When inputs are sequences
- Self-attention + Parallel computation
- 2015 ICLR - Neural Machine Translation by Jointly Learning to Align and Translate - Additive Attention
- 2017 NeuralPS - Attention Is All You Need - Self-Attention / Scaled Dot-Product Attention
Mamba
- Linear-Time Sequence Modeling
- State Space Model - SSM - with selective long-range memory
- 2023 - Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Conformer
- Convolution + Transformer = Conformer
- Combines local - CNN - and Global Self-attention Features
- Widely used in speech recognition tasks
- 2020 - Conformer: Convolution-augmented Transformer for Speech Recognition
GAN
- Generator vs Discriminator
- Generates Images, Audio
- Popular in TTS, audio enhancement, and image generation
- 2014 - Generative Adversarial Nets
Diffusion Based
- Gradual denoising process to generate samples from noise
- Currently SoTA in image and speech generation
- Training is stable, generation is slow
- In Diffusion
  - The model learns to reverse noise through a pre-defined noise schedule
  - It does not evaluate or penalize each intermediate step
  - There is no “fitness score” like in genetic algorithms
- In genetic algorithms - GA
  - Every candidate (individual) is evaluated using a fitness function
  - Poor candidates are penalized or discarded
  - 2020 - Denoising Diffusion Probabilistic Models
📍 SSL
- Learns from unlabeled data by solving pretext tasks
- Strong performance in low-resource and zero-shot setups
Memory - Transformers vs. RNN / LSTM
- Add Reflection - 2024 - You Only Cache Once: Decoder-Decoder Architectures for Language Models
Flow Matching
- An Introduction to Flow Matching

1. Gradient Noise

Knowledge Map

2. What is Gradient Noise

Source	Explanation
Sampling noise	Each batch only samples part of the data, so the gradient is an approximation of the true mean.
Reward noise (RL-specific)	Rewards from the environment vary greatly across trajectories.
Numerical noise (hardware)	Floating-point rounding errors, limited bfloat16 precision, or non-deterministic accumulation order.
Communication noise (multi-GPU)	Random order of all-reduce operations causes slight variations in summed gradients.
Regularization noise	Dropout and mixed-precision scaling introduce artificial randomness.

\[\nabla L(\theta) = \frac{\partial L}{\partial \theta}\]

Compute the exact gradient of the loss function (ideal case)

\[\tilde{\nabla} L(\theta) = \nabla L(\theta) + \varepsilon\]

Represent the noisy gradient observed in practice (with noise term ( \varepsilon ))

3. Why Gradient Noise is Especially Large in RL

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]

Expected future reward objective in reinforcement learning

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau) \right]\]

Policy gradient estimating how parameters affect expected reward

\[Var(\nabla_\theta J) = Var\left(R \cdot \nabla_\theta \log \pi_\theta \right)\]

High variance of rewards and log-probabilities amplifies gradient noise

4. Learning Rates - Theoretically

2025 - A Proof of Learning Rate Transfer under µP

5. Historical Development of Kernel Functions

Year	Person / School	Contribution
1909	James Mercer	Proposed Mercer’s Theorem — the mathematical relationship between symmetric positive-definite kernel functions and inner-product spaces of high dimension.
1930 – 1950	Integral-Equation School	The term kernel originally referred to the weighting function of an integral operator, $K(x, y)$.
1960s	Vapnik & Chervonenkis (USSR)	Developed statistical learning theory and introduced the idea of implicit feature mapping.
1992 – 1995	Vapnik, Boser, Guyon, Cortes	Formally applied kernel functions in Support Vector Machines (SVM) using the kernel trick to avoid explicit high-dimensional mapping.

6. Common Kernel Functions

Kernel Name	Formula	Feature
Linear Kernel	$K(x, y) = x^{T}y$	Original linear inner product
Polynomial Kernel	$K(x, y) = (x^{T}y + c)^{d}$	Polynomial non-linear mapping
RBF / Gaussian Kernel	$K(x, y) = \exp!\big(-\|x - y\|^{2} / (2\sigma^{2})\big)$	Infinite-dimensional mapping; most commonly used
Sigmoid Kernel	$K(x, y) = \tanh(\alpha\, x^{T}y + c)$	Similar to a neural-network activation function
Laplacian / Exponential Kernel	$K(x, y) = \exp!\big(-\|x - y\| / \sigma\big)$	More sensitive to sparse features

References

2025 - Nested Learning