- Letβs take a look at the history of the Model Structures weβre using today.
Evolution of 3D Scene Representations
| Period | Method | Representation | Advantages / Limitations |
| 1980β2010s | SfM / MVS / Mesh | Explicit point clouds or polygonal meshes | Accurate but discrete; cannot represent complex appearance or soft surfaces |
| 1990β2010s | IBR / Light Field | Sampled light rays or image interpolation | Highly photorealistic but lacks true 3D geometry; strong view dependence |
| 2015β2019 | Deep Implicit Fields | Implicit functions (Occupancy / Signed Distance Field) | Continuous and smooth geometry; no explicit color or reflectance modeling |
| 2020β2022 | NeRF family | Neural radiance fields (density + color) | Unified geometry and appearance; high fidelity but slow to train and render |
| 2023βNow | 3D Gaussian Splatting (3DGS) | Explicit point-based volumetric primitives (Gaussian ellipsoids with color, opacity, and anisotropy) | Extremely fast rendering and editing; preserves view consistency; but lacks strong geometry regularization and semantic understanding |
π Core Continuous Modeling Methods In Computational Geometry
| Method | Core Idea | Advantages | Applicability to Your Scenario |
| Parametric Splines / NURBS / B-Spline / T-Spline | Define surfaces or volumes using parameters ( (u,v) ) (or higher-dimensional equivalents). Geometry is represented via control points and basis functions (e.g., B-Splines). | High geometric precision and controllability; widely used in industrial CAD. | Works well for static, parameterizable objects, but limited for dynamic 4D or large-scale scenes. |
| Subdivision Surfaces | Generate smooth limit surfaces by repeatedly subdividing a coarse mesh (e.g., CatmullβClark subdivision). | Provides mathematically well-defined limit surfaces; handles certain topological changes. | Effective for deformable shapes such as humans or cloth, but less flexible for time-varying or large-scale 4D tracking. |
| Implicit Continuous Fields (Neural or Analytical) | Represent geometry implicitly as a function ( f(x, y, z) = 0 ) or ( f(x, y, z, t) = 0 ), such as DeepSDF (Signed Distance Function) or OctField (adaptive implicit representation). | Extremely flexible; topology-free; can model dynamics; suitable for learning-based frameworks. | Highly promising for 4D reconstruction and tracking, especially when aiming for generalization across scenes. |
| Partial Differential Equations (PDEs) and Variational Methods | Enforce continuity (e.g., ( C^2 ) smoothness) of surfaces or volumes via PDEs, or use variational formulations to model continuous media. | Strong geometric continuity and clear mathematical properties. | Valuable when physical constraints or deformation laws are known, but computationally expensive for large-scale neural or self-supervised learning setups. |
DL after Classic ML
| Component | Description |
| Origin | Deep Learning was formalized in 1986 by Rumelhart, Hinton, and Williams with the invention of Backpropagation. |
| Key Idea | Learn hierarchical representations β from low-level edges to high-level concepts β through multiple neural layers. |
| Representation | Automatically extracts features from raw data (images, audio, text) instead of manual feature engineering. |
| Optimization | Trains large neural networks using gradient descent and backpropagation to minimize a defined loss. |
| Architecture | Stacks multiple nonlinear transformations (e.g., CNNs, RNNs, Transformers) to form deep computational graphs. |
| Generalization | Learns robust patterns that transfer to unseen data, aided by large datasets, GPUs, and regularization methods. |
| Impact / Use | Powers modern AI systems in vision (CNNs), language (Transformers), speech (RNNs), and generative models (Diffusion, GANs). |
Deep Learning, Training, and Knowledge Distillation
| Dimension | Deep Learning | Training | Knowledge Distillation |
| Objective | Learn multi-layer nonlinear function f(x; ΞΈ) to represent complex patterns. | Optimize a loss function from data. | Make the student model mimic the teacherβs output distribution and internal representations. |
| Input Information | Raw data (x) | (x, y) | (x, y, T(x)) |
| Loss Function | Any differentiable objective. | Task loss π(f(x), y) | Ξ± π(f(x), y) + (1βΞ±) KL(f(x) β₯ T(x)) |
| Supervision Source | Data itself. | Hard labels (y). | Teacher outputs (T(x)) + true labels (y). |
| Entropy Characteristic | May be high or low depending on task. | Low-entropy one-hot supervision. | High-entropy soft targets (smoothed teacher outputs). |
| Optimization Process | BP + GD (Backpropagation + Gradient Descent). | BP + GD. | BP + GD with temperature scaling Ο. |
| Application Goal | General representation learning. | Task-specific model fitting. | Model compression, knowledge transfer, or performance enhancement. |
| Output Features | Deep hierarchical representations. | Task predictions. | Balanced task accuracy and teacherβstudent alignment. |
Deep Learning World Classical ML World
βββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
Raw Data β Multi-layer Network β Handcrafted Features β Shallow Model β
Learn Representations β Optimize by Gradient Manual Design β Limited Adaptability
β β β β
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββ
β Raw Inputs β β β Neural Layers β vs. β Engineered Featsβ β β Classifier (SVM/Tree) β
β (Image/Text) β β (CNN/RNN/Trans.)β β (HOG/SIFT/MFCC) β β (Fixed Decision Rules) β
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββ
β β β β
End-to-End Learning Automatic Feature Hierarchy Manual Tuning Needed Poor Transferability
(Backprop + Gradient) (LowβMidβHigh Abstractions) (Domain-Specific) (Retrain for New Task)
Hybrid approaches:
1. Use pretrained deep features + classical models for fast adaptation
2. Fine-tune deep backbones with task-specific heads for efficiency
Deep Learning = Student who learns concepts from examples (automatic understanding)
Classical ML = Student who uses fixed formulas (must be told what features matter)
Generalization Ability
Unsupervised World Zero-Shot World
βββββββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββββββ
No Labels β Discover Patterns β Pretrained Knowledge β New Task β
Cluster / Reduce Dim β Build Representations Direct Prediction β Works Instantly
β β β β
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββ
β Data Patterns β β β Learned Embeds β vs. β Language / Text β β β Recognize Unseen Task β
β (Raw Inputs) β β (Structure Only)β β Semantic Priors β β (Zero Examples) β
βββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββββββββ
β β β β
Unclear for Tasks Needs Extra Step Direct Generalization Immediate Usability
(PCA/K-means/SimCLR) (Downstream Fine-tune) (CLIP, GPT) (Zero-shot QA/CLS)
Hybrid approaches:
1. Learn unsupervised embeddings β map to semantic space for zero-shot transfer
2. Combine raw pattern discovery with pretrained knowledge for stronger generalization
Unsupervised = Tourist wandering a city with no map (discover zones by yourself)
Zero-Shot = Tourist with a guidebook (instantly spot city hall & cathedral)
Why Deep Structure
- Compared with the original machine learning models:
- Linear Regression / SVM / Shallow Decision Trees
- Deep structures refer to neural networks with Multiple Layers of Nonlinear Transformations
- how to find some other temporal modeling way for the Non-linear Transformation?
Content
- Transformer
- Mamba
- GPT
- Tokenization
- ARIMA
- RNN, LSTM, GRU
- Diffusion Models
- Flow Matching
- Quantization / Adapter Guided - LoRA + QLoRA
- These deep models are capable of Learning Hierarchical Features, where each layer captures increasingly abstract representations of the data.
Local Minimal vs. Saddle Point
In practice, βDeepβ means:
- More than 3β4 layers in fully connected networks
- 10+ layers in convolutional networks
- Or even hundreds of layers in modern transformers like GPT
Key Structures
- MLP
- Multilayer Perceptron
- Feedforward fully connected networks
- Used in classification, regression, or small-scale tabular/audio tasks
- 1989 - Universal Approximation Theorem / Still used as light head in multimodal systems
- π RNN -> LSTM
- Some Other π Temporal Modeling
- GRU
- ConvGRU
- DynamicLSTM
- GatedGRU
- CNN
- Convolutional Neural Networks
- When inputs are images or grid-like data
- Extracts spatial features, widely used in image/audio tasks
- Fully Connected Layer -> Receptive Field -> Parameter Sharing -> Convolutional Layer
- 1998 - LeNet / 2012 - AlexNet: ImageNet Classification with Deep Convolutional Neural Networks
- Transformer
- Mamba
- Conformer
- GAN
- Diffusion Based
- Gradual denoising process to generate samples from noise
- Currently SoTA in image and speech generation
- Training is stable, generation is slow
- In Diffusion
- The model learns to reverse noise through a pre-defined noise schedule
- It does not evaluate or penalize each intermediate step
- There is no βfitness scoreβ like in genetic algorithms
- In genetic algorithms - GA
- Every candidate (individual) is evaluated using a fitness function
- Poor candidates are penalized or discarded
- 2020 - Denoising Diffusion Probabilistic Models
- π SSL
- Learns from unlabeled data by solving pretext tasks
- Strong performance in low-resource and zero-shot setups
- Memory - Transformers vs. RNN / LSTM
- Add Reflection - 2024 - You Only Cache Once: Decoder-Decoder Architectures for Language Models
- Flow Matching
1. Gradient Noise
2. What is Gradient Noise
| Source | Explanation |
| Sampling noise | Each batch only samples part of the data, so the gradient is an approximation of the true mean. |
| Reward noise (RL-specific) | Rewards from the environment vary greatly across trajectories. |
| Numerical noise (hardware) | Floating-point rounding errors, limited bfloat16 precision, or non-deterministic accumulation order. |
| Communication noise (multi-GPU) | Random order of all-reduce operations causes slight variations in summed gradients. |
| Regularization noise | Dropout and mixed-precision scaling introduce artificial randomness. |
\[\nabla L(\theta) = \frac{\partial L}{\partial \theta}\]
- Compute the exact gradient of the loss function (ideal case)
\[\tilde{\nabla} L(\theta) = \nabla L(\theta) + \varepsilon\]
- Represent the noisy gradient observed in practice (with noise term ( \varepsilon ))
3. Why Gradient Noise is Especially Large in RL
\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]\]
- Expected future reward objective in reinforcement learning
\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau} \left[ \nabla_\theta \log \pi_\theta(a_t \mid s_t) \cdot R(\tau) \right]\]
- Policy gradient estimating how parameters affect expected reward
\[Var(\nabla_\theta J) = Var\left(R \cdot \nabla_\theta \log \pi_\theta \right)\]
- High variance of rewards and log-probabilities amplifies gradient noise
4. Learning Rates - Theoretically
5. Historical Development of Kernel Functions
| Year | Person / School | Contribution |
| 1909 | James Mercer | Proposed Mercerβs Theorem β the mathematical relationship between symmetric positive-definite kernel functions and inner-product spaces of high dimension. |
| 1930 β 1950 | Integral-Equation School | The term kernel originally referred to the weighting function of an integral operator, $K(x, y)$. |
| 1960s | Vapnik & Chervonenkis (USSR) | Developed statistical learning theory and introduced the idea of implicit feature mapping. |
| 1992 β 1995 | Vapnik, Boser, Guyon, Cortes | Formally applied kernel functions in Support Vector Machines (SVM) using the kernel trick to avoid explicit high-dimensional mapping. |
6. Common Kernel Functions
| Kernel Name | Formula | Feature |
| Linear Kernel | $K(x, y) = x^{T}y$ | Original linear inner product |
| Polynomial Kernel | $K(x, y) = (x^{T}y + c)^{d}$ | Polynomial non-linear mapping |
| RBF / Gaussian Kernel | $K(x, y) = \exp!\big(-|x - y|^{2} / (2\sigma^{2})\big)$ | Infinite-dimensional mapping; most commonly used |
| Sigmoid Kernel | $K(x, y) = \tanh(\alpha\, x^{T}y + c)$ | Similar to a neural-network activation function |
| Laplacian / Exponential Kernel | $K(x, y) = \exp!\big(-|x - y| / \sigma\big)$ | More sensitive to sparse features |
References