Model Structures - 25


  • Let’s take a look at the history of the Model Structures we’re using today.

Evolution of 3D Scene Representations

Period Method Representation Advantages / Limitations
1980–2010s SfM / MVS / Mesh Explicit point clouds or polygonal meshes Accurate but discrete; cannot represent complex appearance or soft surfaces
1990–2010s IBR / Light Field Sampled light rays or image interpolation Highly photorealistic but lacks true 3D geometry; strong view dependence
2015–2019 Deep Implicit Fields Implicit functions (Occupancy / Signed Distance Field) Continuous and smooth geometry; no explicit color or reflectance modeling
2020–2022 NeRF family Neural radiance fields (density + color) Unified geometry and appearance; high fidelity but slow to train and render
2023–Now 3D Gaussian Splatting (3DGS) Explicit point-based volumetric primitives (Gaussian ellipsoids with color, opacity, and anisotropy) Extremely fast rendering and editing; preserves view consistency; but lacks strong geometry regularization and semantic understanding


πŸ“ Core Continuous Modeling Methods In Computational Geometry

Method Core Idea Advantages Applicability to Your Scenario
Parametric Splines / NURBS / B-Spline / T-Spline Define surfaces or volumes using parameters ( (u,v) ) (or higher-dimensional equivalents). Geometry is represented via control points and basis functions (e.g., B-Splines). High geometric precision and controllability; widely used in industrial CAD. Works well for static, parameterizable objects, but limited for dynamic 4D or large-scale scenes.
Subdivision Surfaces Generate smooth limit surfaces by repeatedly subdividing a coarse mesh (e.g., Catmull–Clark subdivision). Provides mathematically well-defined limit surfaces; handles certain topological changes. Effective for deformable shapes such as humans or cloth, but less flexible for time-varying or large-scale 4D tracking.
Implicit Continuous Fields (Neural or Analytical) Represent geometry implicitly as a function ( f(x, y, z) = 0 ) or ( f(x, y, z, t) = 0 ), such as DeepSDF (Signed Distance Function) or OctField (adaptive implicit representation). Extremely flexible; topology-free; can model dynamics; suitable for learning-based frameworks. Highly promising for 4D reconstruction and tracking, especially when aiming for generalization across scenes.
Partial Differential Equations (PDEs) and Variational Methods Enforce continuity (e.g., ( C^2 ) smoothness) of surfaces or volumes via PDEs, or use variational formulations to model continuous media. Strong geometric continuity and clear mathematical properties. Valuable when physical constraints or deformation laws are known, but computationally expensive for large-scale neural or self-supervised learning setups.


DL after Classic ML

Component Description
Origin Deep Learning was formalized in 1986 by Rumelhart, Hinton, and Williams with the invention of Backpropagation.
Key Idea Learn hierarchical representations β€” from low-level edges to high-level concepts β€” through multiple neural layers.
Representation Automatically extracts features from raw data (images, audio, text) instead of manual feature engineering.
Optimization Trains large neural networks using gradient descent and backpropagation to minimize a defined loss.
Architecture Stacks multiple nonlinear transformations (e.g., CNNs, RNNs, Transformers) to form deep computational graphs.
Generalization Learns robust patterns that transfer to unseen data, aided by large datasets, GPUs, and regularization methods.
Impact / Use Powers modern AI systems in vision (CNNs), language (Transformers), speech (RNNs), and generative models (Diffusion, GANs).


Deep Learning, Training, and Knowledge Distillation

Dimension Deep Learning Training Knowledge Distillation
Objective Learn multi-layer nonlinear function f(x; ΞΈ) to represent complex patterns. Optimize a loss function from data. Make the student model mimic the teacher’s output distribution and internal representations.
Input Information Raw data (x) (x, y) (x, y, T(x))
Loss Function Any differentiable objective. Task loss 𝓛(f(x), y) Ξ± 𝓛(f(x), y) + (1βˆ’Ξ±) KL(f(x) βˆ₯ T(x))
Supervision Source Data itself. Hard labels (y). Teacher outputs (T(x)) + true labels (y).
Entropy Characteristic May be high or low depending on task. Low-entropy one-hot supervision. High-entropy soft targets (smoothed teacher outputs).
Optimization Process BP + GD (Backpropagation + Gradient Descent). BP + GD. BP + GD with temperature scaling Ο„.
Application Goal General representation learning. Task-specific model fitting. Model compression, knowledge transfer, or performance enhancement.
Output Features Deep hierarchical representations. Task predictions. Balanced task accuracy and teacher–student alignment.


Deep Learning World                             Classical ML World
═══════════════════════════════════             ════════════════════════════════════
Raw Data β†’ Multi-layer Network β†’                Handcrafted Features β†’ Shallow Model β†’
Learn Representations β†’ Optimize by Gradient    Manual Design β†’ Limited Adaptability
     ↓                     ↓                           ↓                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Raw Inputs    β”‚ β†’ β”‚ Neural Layers   β”‚    vs.    β”‚ Engineered Featsβ”‚ β†’ β”‚ Classifier (SVM/Tree) β”‚
β”‚ (Image/Text)  β”‚   β”‚ (CNN/RNN/Trans.)β”‚           β”‚ (HOG/SIFT/MFCC) β”‚   β”‚ (Fixed Decision Rules) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓                     ↓                           ↓                     ↓
End-to-End Learning     Automatic Feature Hierarchy   Manual Tuning Needed   Poor Transferability
(Backprop + Gradient)   (Low→Mid→High Abstractions)   (Domain-Specific)      (Retrain for New Task)

Hybrid approaches:
1. Use pretrained deep features + classical models for fast adaptation  
2. Fine-tune deep backbones with task-specific heads for efficiency  

Deep Learning = Student who learns concepts from examples (automatic understanding)  
Classical ML  = Student who uses fixed formulas (must be told what features matter)


Generalization Ability

Unsupervised World                              Zero-Shot World
═══════════════════════════════════             ════════════════════════════════════
No Labels β†’ Discover Patterns β†’                 Pretrained Knowledge β†’ New Task β†’
Cluster / Reduce Dim β†’ Build Representations    Direct Prediction β†’ Works Instantly
     ↓                     ↓                           ↓                     ↓
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Data Patterns β”‚ β†’ β”‚ Learned Embeds  β”‚    vs.    β”‚ Language / Text β”‚ β†’ β”‚ Recognize Unseen Task β”‚
β”‚ (Raw Inputs)  β”‚   β”‚ (Structure Only)β”‚           β”‚ Semantic Priors β”‚   β”‚ (Zero Examples)       β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
     ↓                     ↓                           ↓                     ↓
Unclear for Tasks      Needs Extra Step            Direct Generalization   Immediate Usability
(PCA/K-means/SimCLR)   (Downstream Fine-tune)      (CLIP, GPT)             (Zero-shot QA/CLS)

Hybrid approaches:
1. Learn unsupervised embeddings β†’ map to semantic space for zero-shot transfer
2. Combine raw pattern discovery with pretrained knowledge for stronger generalization

Unsupervised = Tourist wandering a city with no map (discover zones by yourself)  
Zero-Shot   = Tourist with a guidebook (instantly spot city hall & cathedral)


Why Deep Structure

  • Compared with the original machine learning models:
  • Linear Regression / SVM / Shallow Decision Trees
  • Deep structures refer to neural networks with Multiple Layers of Nonlinear Transformations
  • how to find some other temporal modeling way for the Non-linear Transformation?

Content

  • Transformer
  • Mamba
  • GPT
  • Tokenization
  • ARIMA
  • RNN, LSTM, GRU
  • Diffusion Models
  • Flow Matching
  • Quantization / Adapter Guided - LoRA + QLoRA


  • These deep models are capable of Learning Hierarchical Features, where each layer captures increasingly abstract representations of the data.


Local Minimal vs. Saddle Point


Knowledge Map


In practice, β€œDeep” means:

  • More than 3–4 layers in fully connected networks
  • 10+ layers in convolutional networks
  • Or even hundreds of layers in modern transformers like GPT


Key Structures

  • MLP
    • Multilayer Perceptron
    • Feedforward fully connected networks
    • Used in classification, regression, or small-scale tabular/audio tasks
    • 1989 - Universal Approximation Theorem / Still used as light head in multimodal systems
  • πŸ“ RNN -> LSTM
  • Some Other πŸ“ Temporal Modeling
    • GRU
    • ConvGRU
    • DynamicLSTM
    • GatedGRU
  • CNN
    • Convolutional Neural Networks
    • When inputs are images or grid-like data
    • Extracts spatial features, widely used in image/audio tasks
    • Fully Connected Layer -> Receptive Field -> Parameter Sharing -> Convolutional Layer
    • 1998 - LeNet / 2012 - AlexNet: ImageNet Classification with Deep Convolutional Neural Networks
  • Transformer
  • Mamba
  • Conformer
  • GAN
  • Diffusion Based
    • Gradual denoising process to generate samples from noise
    • Currently SoTA in image and speech generation
    • Training is stable, generation is slow
    • In Diffusion
      • The model learns to reverse noise through a pre-defined noise schedule
      • It does not evaluate or penalize each intermediate step
      • There is no β€œfitness score” like in genetic algorithms
    • In genetic algorithms - GA
      • Every candidate (individual) is evaluated using a fitness function
      • Poor candidates are penalized or discarded
      • 2020 - Denoising Diffusion Probabilistic Models
  • πŸ“ SSL
    • Learns from unlabeled data by solving pretext tasks
    • Strong performance in low-resource and zero-shot setups
  • Memory - Transformers vs. RNN / LSTM
    • Add Reflection - 2024 - You Only Cache Once: Decoder-Decoder Architectures for Language Models
  • Flow Matching


1. Gradient Noise

Knowledge Map


2. What is Gradient Noise

Source Explanation
Sampling noise Each batch only samples part of the data, so the gradient is an approximation of the true mean.
Reward noise (RL-specific) Rewards from the environment vary greatly across trajectories.
Numerical noise (hardware) Floating-point rounding errors, limited bfloat16 precision, or non-deterministic accumulation order.
Communication noise (multi-GPU) Random order of all-reduce operations causes slight variations in summed gradients.
Regularization noise Dropout and mixed-precision scaling introduce artificial randomness.


\[\nabla L(\theta) = \frac{\partial L}{\partial \theta}\]
  • Compute the exact gradient of the loss function (ideal case)
\[\tilde{\nabla} L(\theta) = \nabla L(\theta) + \varepsilon\]
  • Represent the noisy gradient observed in practice (with noise term ( \varepsilon ))


3. Why Gradient Noise is Especially Large in RL

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]
  • Expected future reward objective in reinforcement learning
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau) \right]\]
  • Policy gradient estimating how parameters affect expected reward
\[Var(\nabla_\theta J) = Var\left(R \cdot \nabla_\theta \log \pi_\theta \right)\]
  • High variance of rewards and log-probabilities amplifies gradient noise


4. Learning Rates - Theoretically


5. Historical Development of Kernel Functions

Year Person / School Contribution
1909 James Mercer Proposed Mercer’s Theorem β€” the mathematical relationship between symmetric positive-definite kernel functions and inner-product spaces of high dimension.
1930 – 1950 Integral-Equation School The term kernel originally referred to the weighting function of an integral operator, $K(x, y)$.
1960s Vapnik & Chervonenkis (USSR) Developed statistical learning theory and introduced the idea of implicit feature mapping.
1992 – 1995 Vapnik, Boser, Guyon, Cortes Formally applied kernel functions in Support Vector Machines (SVM) using the kernel trick to avoid explicit high-dimensional mapping.


6. Common Kernel Functions

Kernel Name Formula Feature
Linear Kernel $K(x, y) = x^{T}y$ Original linear inner product
Polynomial Kernel $K(x, y) = (x^{T}y + c)^{d}$ Polynomial non-linear mapping
RBF / Gaussian Kernel $K(x, y) = \exp!\big(-|x - y|^{2} / (2\sigma^{2})\big)$ Infinite-dimensional mapping; most commonly used
Sigmoid Kernel $K(x, y) = \tanh(\alpha\, x^{T}y + c)$ Similar to a neural-network activation function
Laplacian / Exponential Kernel $K(x, y) = \exp!\big(-|x - y| / \sigma\big)$ More sensitive to sparse features





References






Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • CV Data Sets
  • Orgo.ai - 25
  • Story Series
  • Knowledge Map