2025 - Thesis - Deep Learning
Latent Space, UZH AI, AI Center
Diffusion Models
- ๐ How Diffusions Work
- Workflow with your auto Research paper generation Tools
- Model Structures
- 2025 - Video models are zero-shot Learners and Reasoners
- 2025 - Towards ๐ End-to-End Generative Modeling
- 2025 - Back to Basics: Let Denoising Generative Models Denoise
def compute_distillation_loss()
cos_sim = (s * t).sum(dim=-1).clamp(-1 + eps, 1 - eps)
geo_loss = torch.acos(cos_sim).mean()
...
total_loss = ()
return total_loss, ce_loss.item(), kl_loss.item(), geo_loss.item()
- [1/3] Stabilizing the Training, in Latent Space
- [2/3] Training Loss with different training set amounts
-
[3/3] Codebase - 2025 - Pre-Training Pipeline
-
A Neural Networkis essentially a differentiable function approximator. Its difference from traditional linear regression lies not in the objective function, but in theoptimization methodandthe complexity of the function it can represent
Backpropagation
| Stage | Operation | Expression | Meaning |
|---|---|---|---|
| Forward Pass | Compute layer outputs | \(z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}, \quad a^{(l)} = f(z^{(l)})\) | Obtain network predictions |
| Compute Loss | Compute error | \(L = \tfrac{1}{2}|\hat{y} - y|^2\) | Measure output error |
| Backward Pass | Backpropagate from output layer | \(\delta^{(L)} = (\hat{y} - y) \odot f'(z^{(L)})\) | Compute output-layer gradient |
| ย | Propagate to previous layers | \(\delta^{(l)} = (W^{(l+1)})^T \delta^{(l+1)} \odot f'(z^{(l)})\) | Compute hidden-layer gradients |
| Gradient Computation | Compute parameter gradients | \(\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T\) | Obtain weight gradients |
| Update | Update parameters | \(W^{(l)} \leftarrow W^{(l)} - \eta \frac{\partial L}{\partial W^{(l)}}\) | Optimize via gradient descent |
Optimal Method as Below with Hash Value
| Problem | Original Complexity | Optimal Complexity | Optimal Method | Further Optimization |
|---|---|---|---|---|
| Check Anagram | O(n) | O(n) | Counter / Hash Map | Cannot Be Improved |
| Dictionary Anagram Lookup | O(M ร N log N) | O(M ร N) | Hash Value + Character Count Key | Significantly Optimizable |
Hash Map and Graph for Optimization
| Analogy | Hash Map in Data Structures | Dynamic Programming / Graph in Algorithms |
|---|---|---|
| Essence | Trade space for time โ achieve O(1) lookup. | Trade state-graph computation for optimal solution โ typically O(N ร M). |
| Advantage | Globally optimal method for key lookup. | Globally optimal framework for decision and optimization. |
| Limitation | Only applicable to keyโvalue lookup problems. | Only applicable to decomposable problems with optimal substructure. |
| Conclusion | The most efficient in the lookup domain. | The most general but not universal in the optimization/decision domain. |
Languages
| Dimension | Rust | Go (Golang) | C++ | Python |
|---|---|---|---|---|
Essentially OOP | โ (OOP-like, but primarily functional) | โ (Has OOP features, but fundamentally procedural and concurrent) | โ (Classic, strongly object-oriented) | โ (Dynamic, fully object-oriented) |
| Programming Paradigm | Multi-paradigm: Primarily functional + systems, supports OOP traits | Procedural + concurrent, limited OOP | Multi-paradigm: Strongly object-oriented + generic | Multi-paradigm: Object-oriented + scripting |
| Type System | Static, compiled | Static, compiled | Static, compiled | Dynamic, interpreted |
| Memory Management | No GC; uses ownership + borrow checker | Automatic GC | Manual (new/delete) or smart pointers | Automatic GC |
| Concurrency Model | Lock-free, type-safe (โfearless concurrencyโ) | Goroutines + channels (CSP model) | Multithreading with manual locks | GIL limits true multithreading |
| Performance | Nearly equal to C++ | Close to C++, slightly slower (GC overhead) | Fastest native performance | Slowest (interpreted) |
| Safety | Compile-time memory safety; prevents data races | Memory-safe but not thread-safe | Very fast but error-prone (dangling pointers, overflows) | Safe but slow |
| Learning Curve | Steep (requires ownership understanding) | Easy (simple syntax) | Steep (complex syntax and templates) | Easiest (beginner-friendly) |
| Compile Speed | Slow | Fast | Slow (especially for large projects) | None (interpreted) |
| Ecosystem | Young but growing fast (systems, embedded, backend) | Mature (cloud, DevOps, microservices) | Broadest (systems, games, embedded) | Broadest (AI, data science, web) |
| Applications | System programming, secure backend, embedded, WebAssembly | Cloud-native systems, microservices, networking | OS, game engines, graphics | AI/ML, scripting, automation, data analysis |
| Philosophy | โZero-cost abstractionโ โ safety + performance | โPragmatic simplicityโ โ simplicity + efficiency | โTotal controlโ โ performance + flexibility | โEase of useโ โ simplicity + rapid prototyping |
| Key Projects | Firefox, Tokio, AWS Firecracker | Docker, Kubernetes, Terraform | Unreal Engine, Chrome, TensorRT | PyTorch, TensorFlow, YouTube |
Latent Space Structure
| Space | Core Definition | Difference from Others | Application Domains |
|---|---|---|---|
Hilbert Space | A complete inner product space where lengths, angles, and projections are well-defined | Serves as the foundational โperfectโ geometric space; all others are generalizations or relaxations | Quantum mechanics, signal processing, optimization, machine learning |
| Banach Space | A complete normed vector space, not necessarily with an inner product | Has length but no angles | Non-Euclidean optimization, functional analysis |
| Riemannian Manifold | Each point has a local inner-product space (tangent space) | Locally Hilbert, globally curved | General relativity, geometric deep learning |
| Symplectic Space | Equipped with an area-preserving bilinear form | No distance, only conserved quantities | Classical mechanics, Hamiltonian systems |
| Topological Space | Defined only by neighborhood relationships, no metric required | No notion of length or angle | Generalized geometry, continuity, homotopy theory |
| Metric Space | A set with a defined distance function d(x, y) | Hilbert space is a special case | Clustering, manifold learning, distance-metric learning |
| Probability Space | A measurable space (ฮฉ, F, P) defining random events | Describes the geometry of events | Probability theory, information geometry, Bayesian inference |
| Information Manifold | A Riemannian manifold on probability distributions | Uses Fisher information metric | Statistical inference, information geometry, variational inference |
| Kรคhler / Complex Space | Complex structure + symmetric geometry + metric | Conformal generalization of Hilbert space | Quantum geometry, string theory, complex optimization |
Algorithms
โโโ I. Data Structures
โ โโโ Stack, Queue, <HashMap>, LinkedList
โ
โโโ II. Algorithmic Patterns
โ โโโ Two Pointers
โ โโโ Sliding Window
โ โโโ Prefix Sum
โ โโโ Monotonic Stack / Queue
โ โโโ Binary Search Patterns
โ
โโโ III. Complex Algorithms
โ โโโ <Dynamic Programming (DP)>
โ โโโ <Graph Theory (DFS/BFS/Dijkstra)>
โ โโโ Recursion / Backtracking
โ โโโ Greedy Algorithms
โ โโโ Divide & Conquer
โ
โโโ IV. Problem Integration
โโโ Hard composite problems
โโโ Algorithm design questions
Diffusion, Stable Diffusion, Rectified Flow
| Dimension | Vanilla Diffusion Model (DDPM / DDIM) | Stable Diffusion (Latent Diffusion Model, LDM) | Rectified Flow (Flow Matching) |
|---|---|---|---|
| Start Distribution | Starts from pure Gaussian noise N(0, I) | Starts from latent-space noise (compressed through an encoder) | Starts from any distribution point (usually N(0, I), but customizable) |
| Generative Process | Multi-step denoising: reverses the noise diffusion process (xโโโ = fฮธ(xโ, t)) | Multi-step denoising in latent space (computationally cheaper) (zโโโ = fฮธ(zโ, t)) | Continuous one-step flow: learns an ODE (dxโ/dt = vฮธ(xโ, t)) |
| Mathematical Formulation | Discrete Markov chain (reverse SDE) | Discrete SDE in latent space | Continuous ODE or flow field |
| Computational Complexity | Multi-step sampling (20โ1000 steps) | Multi-step but faster in latent space (20โ50 steps) | Single continuous integration step |
| Advantages | High generation quality; theoretically grounded | High resolution, lightweight, and controllable (supports text prompts) | Fast convergence, continuous generation, minimal mode collapse |
| Limitations | Slow sampling; many denoising steps required | Strong dependence on encoder design and latent structure | Sensitive training stability; harder conditional control |
| Representative Papers / Applications | DDPM (Ho et al., 2020); DDIM (Song et al., 2021) | LDM / Stable Diffusion (Rombach et al., CVPR 2022) | Flow Matching / Rectified Flow (Liu et al., ICLR 2023) |
Optimization
| Component / Technique | Description | Implementation |
|---|---|---|
| Optimizer | Gradient-based weight updates with decoupled weight decay to improve stability on large models. | AdamW optimizer with lr=2.6e-4 and default ฮฒ=(0.9, 0.999); stable for transformer-like models. |
| Learning-Rate Schedule | Smooth cosine decay to avoid abrupt gradient shocks after warm-up. | get_cosine_schedule_with_warmup(opt, 1000, 10000) โ warm-up = 1 k steps, total = 10 k steps. |
| Warm-Up Phase | Gradually increases learning rate and KL weight to prevent early divergence in distillation. | Linear warm-up for both learning rate and ฮปโโ (0 โ 0.020584 during first 1000 steps). |
| Mixed-Precision Training | Uses half precision (torch.amp.autocast) to reduce GPU memory and improve throughput. | Forward/backward passes wrapped in autocast, scaled by GradScaler() for numerical stability. |
| Gradient Clipping | Prevents exploding gradients in long sequences. | torch.nn.utils.clip_grad_norm_(params, 1.0) each iteration. |
| Loss Function (Multi-Objective) | Balances semantic accuracy, distribution matching, and geometric alignment. | Total loss: L = LCE + ฮปโโ ยท LKL + ฮปGeo ยท LGeo, with ฮปGeo = 0.969909 constant. |
| CE Loss | Supervised label alignment ensuring correct transcription semantics. | Cross-entropy between student predictions and true tokens. |
| KL Divergence | Soft-target distillation to transfer probability distributions from teacher logits. | F.kl_div(log_softmax(student/T), softmax(teacher/T)) ยท Tยฒ, T = 2.0. |
| Riemannian Geodesic Loss | Aligns feature geometry on curved manifold instead of flat Euclidean MSE. | Geodesic distance = acos(cos_sim) between normalized hidden states. |
| Model Architecture (Student) | Lightweight CNN + Transformer hybrid for speech sequence modeling. | Two 1-D Conv layers โ 6 Transformer encoder blocks โ linear output head. |
| Teacher Model | Provides target logits and hidden features for distillation. | Frozen Whisper-large-v2 (FP16) encoder-decoder model. |
distil_run_cell2.7.2/
โ
โโโ tb/ โ TensorBoard log files
โ โโโ events.out.tfevents...
โ
โโโ adapter_final/ โ Final trained student model
โ โโโ student_model.pt
โ
โโโ checkpoint.pt โ Intermediate checkpoint (used if training was interrupted)
โโโ training_history.json โ Recorded training and validation loss curves
โโโ best_params.json โ Best hyperparameter record (e.g., kl_weight, geo_weight)
โโโ training_config.json โ Training configuration and setup details
Structure
Machine Learning Fundamentals
โ
โโโ Data โ Representation โ Optimization โ Generalization
โ โโ Focus: Data quality, bias mitigation, and representation learning
โ โโ Link to Gemini: multimodal data fusion (text, audio, vision, code)
โ โโ Goal: Learn unified latent spaces that enable reasoning across modalities
โ
โโโ Deep Learning (CNN / RNN / Transformer)
โ โโ Forward & backward propagation as differentiable computation graphs
โ โโ Initialization, normalization, regularization โ stability & convergence
โ โโ Loss design + learning rate scheduling โ control of optimization dynamics
โ โโ Transformer family as universal sequence learners (foundation for Gemini)
โ
โโโ Optimization & Geometry
โ โโ Gradient-based optimization viewed as navigating the loss landscape
โ โโ Flat vs. sharp minima โ generalization and robustness trade-offs
โ โโ Riemannian geometry in embedding space โ alignment on curved manifolds
โ โโ Connection: Geminiโs embedding consistency and representation curvature
โ
โโโ Model Compression & Distillation
โ โโ Knowledge transfer from large to small models (teacher โ student)
โ โโ Soft vs. hard labels โ probabilistic vs. symbolic supervision
โ โโ LoRA / Adapter-based fine-tuning โ parameter-efficient adaptation
โ โโ Trade-offs: accuracy โ latency โ memory footprint โ energy efficiency
โ โโ Relevance: LearnLM and Gemini use adapter-tuned submodels for learning tasks
โ
โโโ ML Engineering & Responsible AI
โโ Data pipelines, reproducibility, evaluation, and continuous integration
โโ Monitoring, checkpointing, scalable deployment on distributed accelerators
โโ Safety alignment and interpretability โ understanding model decisions
โโ Evaluation beyond accuracy: robustness, fairness, value alignment
โโ Ethical ML engineering: accountability and transparency in large systems
Time
| Big-O | Name | Typical Example |
|---|---|---|
| O(1) | Constant time | Accessing array element |
| O(log n) | Logarithmic time | Binary search |
| O(n) | Linear time | Single loop through array |
| O(n log n) | Linearithmic time | Merge sort, Quick sort |
| O(nยฒ) | Quadratic time | Nested loops, Bubble sort |
| O(nยณ) | Cubic time | Triple nested loops |
| O(2โฟ) | Exponential time | Subset / permutation generation |
| O(n!) | Factorial time | Traveling Salesman, N-Queens |
| O(bแต) | Branching search | DFS in state tree with branching b and depth d |
O(n):โโโโโ โO(n log n):โโโโโ โO(nยฒ):
loop โ โ โโโโ โdivide โ sort โ mergeโโโ โ double loop โ compare all
(one pass)โโโ โ(log layers ร n work)โโโโ (each pair compared)
linear scanโโ โ merge / quick sortโโโโโ bubble / selection sort
Space
| Big-O | Name | Typical Example |
|---|---|---|
| O(1) | Constant space | In-place swap, variable assignment |
| O(log n) | Logarithmic space | Recursive binary search |
| O(n) | Linear space | Storing array, dynamic programming 1-D |
| O(nยฒ) | Quadratic space | 2-D matrix, Floyd-Warshall DP |
| O(nยณ) | Cubic space | 3-D DP table |
| O(2โฟ) | Exponential space | Memoization of all subsets |
Data Loader
| Stage | Code Section | Padding Applied | Explanation |
|---|---|---|---|
| โ Dataset structure check | os.walk() file scan | No | Only scans file names, counts, and sizes. |
| โก Load audioโtext pairs | pairs = load_audio_text_pairs(DATA_DIR) | No | Generates file paths, no tensor involved. |
| โข Build Dataset | dataset = LibriSpeechLocalDataset(pairs, processor) | Not yet | Each sample is returned separately, no unified length. |
| โฃ Build DataLoader | train_loader = DataLoader(...) | Yes (here) | Padding is applied when combining samples into a batch. |
| โค Train model | for step, batch in enumerate(train_loader): | Already padded | Batch tensors have equal dimensions for training. |
Protocol and Ports
[You: MacBook] โโ [Encrypted Tunnel (AES)] โโ [Mac Studio Server]
โ โ
ssh user@ip_address sshd (daemon listening on port 22)
| Protocol | Port | Purpose |
|---|---|---|
| HTTP | 80 | Web traffic |
| HTTPS | 443 | Secure web traffic |
| FTP | 21 | File transfer |
| SSH | 22 | Secure remote shell |
| Function | Command Example | Description |
|---|---|---|
Remote Login | ssh user@192.xxx.x.xx | Open a command-line session on a remote computer |
File Transfer | scp file.txt user@host:/path/ | Securely copy a file to a remote machine |
Port Forwarding | ssh -L 8080:localhost:80 user@host | Map a remote port to a local port through an encrypted tunnel |
| Passwordless Login | Public key authentication (~/.ssh/id_rsa.pub) | Automatically authenticate using key pairs |
| Automation Control | Use SSH to execute commands or sync data in bulk | Common in DevOps or HPC environments |
Optimizers
| Era | Optimizer | Year | Core Innovation | Key Equation / Concept | Limitation Solved | Remarks |
|---|---|---|---|---|---|---|
| Classical GD | Gradient Descent (GD) | 1951 | Update weights along the negative gradient direction | \(w_{t+1} = w_t - \eta \nabla L(w_t)\) | None (too simple) | Foundation of all optimizers |
| ย | Stochastic Gradient Descent (SGD) | 1983 | Uses random mini-batches to improve efficiency | \(\nabla L(w_t) \approx \frac{1}{\lvert B \rvert} \sum_{i \in B} \nabla L_i(w_t)\) | High variance and slow convergence | Enables online / large-scale learning |
| Momentum Era | SGD + Momentum | 1989 | Adds velocity term to accumulate past gradients | \(v_t = \beta v_{t-1} + (1-\beta)\nabla L(w_t), \quad w_{t+1} = w_t - \eta v_t\) | Oscillations in narrow valleys | Faster convergence, physics-inspired |
| ย | Nesterov Accelerated Gradient (NAG) | 1991 | Looks ahead using gradient of estimated future position | \(v_t = \beta v_{t-1} + (1-\beta)\nabla L(w_t - \eta \beta v_{t-1})\) | Overshooting in Momentum | Smoother convergence and stability |
| Adaptive Learning | Adagrad | 2011 | Per-parameter adaptive learning rate | \(G_t = \sum_{\tau=1}^{t} g_\tau^2, \quad \eta_{t,i} = \frac{\eta}{\sqrt{G_{t,i}}+\epsilon}\) | Manual learning rate tuning | Excellent for sparse features (NLP) |
| ย | RMSProp | 2012 | Exponentially weighted moving average of squared gradients | \(v_t = \rho v_{t-1} + (1-\rho)g_t^2, \quad w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t+\epsilon}}g_t\) | Adagradโs decaying rate problem | Stable for non-stationary objectives |
| Modern Standard | Adam | 2014 | Combines Momentum and RMSProp | \(m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\) \(\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}, \quad w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\) | Gradient noise and curvature imbalance | Default optimizer for most deep networks |
| ย | AdamW | 2017 | Decouples weight decay from gradient update | \(w_{t+1} = w_t(1-\eta\lambda) - \eta \frac{m_t}{\sqrt{v_t}+\epsilon}\) | L2 regularization bias in Adam | Default for Transformer / LLM training |
| Geometry-Aware & Large Batch | LARS (Layer-wise Adaptive Rate Scaling) | 2018 | Layer-wise adaptive learning rate | \(\eta_l = \eta \frac{|w_l|}{|g_l|+\epsilon}\) | Scale mismatch in large-batch training | Used in ResNet / ImageNet large-batch setups |
| ย | LAMB (Layer-wise Adaptive Moments) | 2019 | Extends LARS with Adam-style moments | \(r_t = \frac{|w_t|}{|\hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon)|}, \quad w_{t+1}=w_t - \eta r_t \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\) | Poor scaling of Adam for huge batches | Core optimizer for BERT, GPT |
| Variance Rectification & Belief Models | RAdam | 2019 | Rectifies variance of adaptive learning rate | \(\eta_t = \eta \frac{\sqrt{(1-\beta_2^t)/(1-\beta_2)}}{\sqrt{v_t}+\epsilon}\) | Instability in early training | More robust warm-up-free Adam |
| ย | AdaBelief | 2020 | Tracks belief in gradient direction | \(v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t - m_t)^2\) | Gradient over-smoothing | Better generalization for small datasets |
| Second-Order & Natural Gradient | K-FAC (Kronecker-Factored Approximate Curvature) | 2015โ2023 | Approximates curvature via blockwise Kronecker products | \(F^{-1} \approx A^{-1} \otimes B^{-1}, \quad w_{t+1} = w_t - \eta F^{-1}\nabla L\) | Ignores curvature in SGD/Adam | Faster convergence, heavy memory use |
| ย | Shampoo | 2021 | Matrix preconditioning per layer | \(G_t = \sum_{\tau=1}^{t} g_\tau g_\tau^\top, \quad W_{t+1}=W_t - \eta G_t^{-1/2}\nabla L\) | Slow convergence on ill-conditioned loss | Improves conditioning for large models |
| Modern LLM Optimizers | Lion | 2023 | Momentum with sign-based updates | \(w_{t+1} = w_t - \eta \, \text{sign}(\beta_1 m_t + (1-\beta_1)g_t)\) | Over-adaptation of Adam | Efficient and strong generalization for LLMs |
| ย | Sophia | 2023 | Second-order curvature-aware optimizer | \(w_{t+1} = w_t - \eta \frac{g_t}{\sqrt{h_t+\epsilon}}, \quad h_t \approx \text{diag}(H_t)\) | Slow convergence in large-scale Adam | State-of-the-art for Transformer training |
Riemannian Projector, Geodesic Loss
class RiemannianProjector(nn.Module):
def __init__(self, in_dim=768, out_dim=1280):
...
def forward(self, x):
x = self.map(x)
return F.normalize(x, dim=-1)
cos_sim = (x*y).sum(-1)
loss = acos(cos_sim)
Teacher (Whisper-large-v2, frozen)
โ
โผ
Student (<Structure-free> Student from the teacher + LoRA adapters)
โ
โโโ CE loss (labels supervision)
โ โ
โ โโโ Hard labels = ground truth text
โ (e.g. โHello worldโ from dataset)
โ
โโโ KL loss (soft logits distillation)
โ โ
โ โโโ Soft labels = teacherโs predicted probabilities
โ (e.g. P(โhelloโ)=0.62, P(โheyโ)=0.31, P(โhaloโ)=0.07)
โ
โโโ Geo loss (Riemannian alignment)
โ
โโโ Aligns latent embeddings on a curved manifold
(ensures student follows teacherโs geometry)
โ
Optimizer (AdamW + Cosine LR)
โ
LoRA Adapter Checkpoint
โ
Evaluation (WER / RTF / Memory)
s_hid = student_proj(s_out.encoder_last_hidden_state)
t_hid = normalize(t_out.encoder_last_hidden_state)
geo = geodesic_distance_on_sphere(s_hid, t_hid)
TAID
Initial training (step=0): ฮป=0.1 intermediate = 0.9 * student_probs + 0.1 * teacher_probs
โ Mainly learn the studentโs own distribution
Mid-training (step=400): ฮปโ0.5 intermediate = 0.5 * student_probs + 0.5 * teacher_probs
โ Balanced learning
Late training (step=800): ฮป=0.9 intermediate = 0.1 * student_probs + 0.9 * teacher_probs
โ Mainly learn the teacherโs distribution
Background Knowledge
RL On Diffusion
I. Base Diffusion Backbone (Generative Prior)
Input (xโ = real data sample: image, trajectory, audio, 3D scene)
โ
Forward Diffusion Process (adds Gaussian noise)
โ
xโ โ โฮฑโยทxโ + โ(1โฮฑโ)ยทฮตโ
xโ โ โฮฑโยทxโ + โ(1โฮฑโ)ยทฮตโ
โฎ
x_T โ pure Gaussian noise N(0, I)
โ
Reverse Denoising Process (parameterized by neural network ฮต_ฮธ)
โ
x_{tโ1} = (x_t โ โ(1โฮฑ_t)ยทฮต_ฮธ(x_t, t, cond)) / โฮฑ_t + ฮทยทฯ_t
โ
UNet / Transformer backbone โ learns to reconstruct xโ
II. Policy Representation via Diffusion
Environment State s_t
โ
Noise z_t ~ N(0, I)
โ
Diffusion Policy Network ฮต_ฮธ(s_t, z_t, t)
โ
Sample Action a_t = Denoise(z_t | s_t)
โ
Execute Action in Environment โ Receive Reward r_t
โ
Collect Trajectory ฯ = {s_t, a_t, r_t}
IV. Reward-Guided Diffusion Training (Diffusion Policy Optimization)
For each episode:
1. Sample noise x_T ~ N(0, I)
2. Run reverse diffusion (ฮต_ฮธ) conditioned on state s_t
3. Generate predicted action trajectory xโ
4. Execute in environment โ collect reward R
5. Compute loss:
L_total = L_diffusion + ฮปยทL_RL
L_RL = โ E[R(ฯ)]
6. Backpropagate through ฮต_ฮธ network
Diffusion Policy, Decision Diffuser
Random Noise in Action Space
โ
Diffusion or Flow Process
โ
Denoising Steps / Continuous Flow
โ
Policy Network predicts ฮตฮธ(x_t,t)
โ
Clean Action Sequence (Optimal Trajectory)
โ
Execute in Environment (Robotics / Control)
| Function | Formula | Derivative | Core Idea | Usage / Notes |
|---|---|---|---|---|
| Sigmoid | \(f(x) = \frac{1}{1 + e^{-x}}\) | \(f'(x) = f(x)\,[1 - f(x)]\) | Smooth bounded mapping (0, 1) | Common in probabilistic outputs |
| Tanh | \(f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\) | \(f'(x) = 1 - f(x)^2\) | Zero-centered output | Improves symmetry over Sigmoid |
| ReLU | \(f(x) = \max(0,\,x)\) | \(f'(x)=\begin{cases}1,&x>0\\0,&x\le0\end{cases}\) | Sparse and efficient | Fast convergence, stable training |
| Leaky ReLU | \(f(x)=\max(\alpha x,\,x)\) | piecewise constant | Avoids dead neurons | Small negative slope for x < 0 |
| Swish / SiLU | \(f(x)=x\,\sigma(x),\ \sigma(x)=\frac{1}{1+e^{-x}}\) | \(f'(x)=\sigma(x)+x\,\sigma(x)[1-\sigma(x)]\) | Smooth, self-gated ReLU | Used in Google EfficientNet |
| Mish | \(f(x)=x\,\tanh(\ln(1+e^x))\) | smooth | Non-monotonic, better gradient flow | Used in YOLOv4, ResNet variants |
| GELU | \(f(x)=x\,\Phi(x),\ \Phi(x)\text{: Gaussian CDF}\) | smooth | Probabilistic gating | Default in Transformers (BERT, GPT) |
| JumpReLU (DeepMind) | \(f(x)=\max(0,\,x-j),\ j\text{ learned}\) | piecewise constant | Learnable sparsity threshold | Used in Sparse Autoencoders for interpretability |
| Softmax | \(f_i(x)=\frac{e^{x_i}}{\sum_j e^{x_j}}\) | โ | Converts logits โ probabilities | Standard output for classification |
Learning Rates
| Trend | Description | Representative Systems |
|---|---|---|
| Cosine + Warmup โ Standard Default | Most stable across architectures. | ViT, GPT-J, Whisper, Stable Diffusion |
| Adaptive + Restart Hybrids | Combine SGDR + ReduceLROnPlateau. | DeepSpeed, Megatron-LM, PaLM 2 |
| Optimizer-Integrated Scheduling | Scheduler coupled with optimizer (AdamW, LAMB). | GPT-4, Gemini 1.5, Claude 3 |
| Noisy / Stochastic Schedules | Inject noise to encourage flat minima. | Google Brain NAS, RL-based training |
| Dynamic Data-Aware LR Control | LR adapted by validation loss or gradient norm. | Reinforcement fine-tuning (RLHF, PPO) |
Scaling Law
| Year | Model | Number of Layers | Parameter Count | FLOPs (per inference) | Activations (per forward pass) | Typical Memory Footprint |
|---|---|---|---|---|---|---|
| 1998 | LeNet | 5 | ~0.1 M | ~0.001 GFLOPs | < 1 MB | < 10 MB |
| 2012 | AlexNet | 8 | 60 M | ~1.5 GFLOPs | ~100 MB | ~1 GB |
| 2015 | VGG-16 | 16 | 138 M | ~15 GFLOPs | ~200 MB | ~2โ4 GB |
| 2016 | ResNet-152 | 152 | 60 M | ~11 GFLOPs | ~250 MB | ~4โ6 GB |
| 2018 | BERT-Large | 24 | 340 M | ~180 GFLOPs | ~1 GB | ~10โ12 GB |
| 2020 | GPT-3 | 96 | 175 B | ~3.1 ร 10ยนยฒ FLOPs | ~20 GB | ~350 GB (weights) / > 1 TB (training) |
| 2024 | GPT-4 / Gemini 1.5 / Claude 3 | ~120 โ 200 | > 1 T (trillion) | ~10ยนยณ โ 10ยนโด FLOPs | > 50 GB (activations) | Multiple TB (large-scale training) |
Generalization and Regularization
Underfitting: Overfitting: Good Embedding:
โข โข โข โข โข โโโ โโโ โฒโฒโฒ โ โ โ โ โฒ โฒ
โ โ โ โ โ (tight) (tight) (clear but smooth)
โฒ โฒ โฒ โฒ โฒ val points outside val & train overlap
| Principle | Intuition |
|---|---|
| Regularization = adding controlled noise or constraints to prevent memorization. | Introduces noise or limits (e.g., dropout, weight decay, data augmentation) so the model learns general patterns instead of memorizing the training set. |
| Overfitting = perfect fit on training data, poor generalization. | The model minimizes training loss too well, capturing noise instead of true structure โ leads to poor performance on unseen data. |
| Goal = flatter minima + smoother decision boundaries. | Seek regions in the loss landscape where small parameter changes do not greatly affect loss โ resulting in more stable, generalizable models. |
CNN
[Input DรE (image or signal)]
โ
โผ
[Convolution UรV (kernel/filter)]
โ learns local spatial patterns
โ parameters โช fully-connected layers
โผ
[Zero-Padding / Stride Control]
โ
โโ Padding โ keeps size (same)
โโ Stride โ downsamples (DโU)/S+1
โผ
[Feature Map KรM (activation before nonlinearity)]
โ
โผ
[Activation g(a) โ ReLU / Sigmoid / Tanh]
โ
โผ
[Pooling RรR window (Avg / Max / Global)]
โ
โโ replaces stride for down-sampling
โโ reduces spatial size, increases receptive field
โโ enhances translation invariance
โผ
[Stacked Conv + Pooling Layers]
โ
โโ small kernels (3ร3) + pooling โ large receptive field
โโ more layers > larger kernels (prefer depth)
โโ weights grow linearly w/ layers
โผ
[Flatten or Global Pooling]
โ
โโ flatten: A โ โ^{QรKรM} โ a โ โ^{QยทKยทM}
โโ global pooling: spatial avg โ a โ โ^{Q}
โผ
[Fully-Connected Layer + Loss]
โ
โโ Regression โ J_L2
โโ Binary โ J_BCE
โโ Categorical โ Softmax + J_CCE
โผ
[Output Prediction y / Class Probabilities]
Forward Pass
Input (32ร32ร3)
โ
Conv (3ร3 kernel, 16 filters)
โ
ReLU activation
โ
Max Pooling (2ร2)
โ
Conv (3ร3 kernel, 32 filters)
โ
ReLU
โ
Global Avg Pooling
โ
Flatten โ Dense (Fully-connected)
โ
Softmax โ [Cat, Dog, Car, โฆ]
Optimization for Training
| Stage | Method | Purpose / Effect |
|---|---|---|
| Initialization Stage | Xavier / He initialization | Avoid falling into poor regions at the start |
| Early Exploration Stage | Large learning rate + Momentum | Maintain global exploration ability |
| Mid Convergence Stage | Adam / RMSProp + Cosine Annealing | Ensure smooth descent and curvature adaptation |
| Late Fine-tuning Stage | SAM / Entropy-SGD / Weight Decay | Locate flat minima and enhance generalization |
| During Training | Mini-batch noise + Dropout | Prevent getting stuck at saddle points |
| Architectural Level | Residual connections / Normalization layers | Improve gradient flow and smooth the optimization landscape |
Normalization and Regularization in different Model Structures
| Item | L1 Regularization | L2 Regularization |
|---|---|---|
| Shape | Diamond-shaped constraint | Circular constraint |
| Optimum Point | Usually lies on the coordinate axes (sparse solution) | Usually lies on the circle (continuous shrinkage) |
| Result | Some weights are โcutโ to exactly 0 | All weights are smoothly reduced but remain non-zero |
| Model Example | Normalization | Regularization | Essence & How It Works |
|---|---|---|---|
| CNN (e.g., ResNet) | Batch Normalization โ normalizes activations within a mini-batch to stabilize gradients and speed up convergence. | Weight Decay + Dropout โ penalizes large weights and randomly drops neurons to reduce overfitting. | Normalization equalizes feature scales during training, while Regularization constrains model capacity to improve generalization. |
| Transformer / LLM | Layer Normalization โ normalizes hidden states across features to maintain stable activations in deep attention layers. | Attention Dropout + L2 Regularization โ randomly masks attention links and adds weight penalties to prevent overfitting. | Normalization stabilizes internal representations; regularization prevents memorization of training data. |
| MLP | Input Standardization โ rescales each input feature to zero mean and unit variance. | L2 Regularization (Ridge) โ discourages large parameter magnitudes for smoother mappings. | Normalization improves numerical stability; regularization enforces simpler models with better generalization. |
Optimized Decoding
Classical Decoding (Without KV Cache) Optimized Decoding (With KV Cache)
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Decoder โ โ Decoder + KV Cache โ
โ (Self-Attn) โ โ (Self-Attn + Storage) โ
โโโโโโโโโฌโโโโโโโโ โโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Recompute all โ O(nยฒ) per step โ Reuse stored K/V โ
โ past tokens โ -----------------------------> โ Only new Q calculated โ
โ at every step โ โ O(n) per step โ
โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โผ โผ
โโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ Latency โ โ Low Latency โ
โ High โ โ On-Device OK โ
โโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
- Redundant computation - No recomputation
- High memory bandwidth - Lower memory & power
- Slow inference - Faster inference
Training Loop
โ
[ Forward pass ]
โ
[ Compute loss ]
โ
[ Backward pass: compute gradients ]
โ
[ **Gradient Clipping** ] โโ `clip_grad_norm_(model.params, max_norm)`
โ
[ **AdamW Update** ] โโ `optimizer = AdamW(lr=โฆ, weight_decay=โฆ)`
โ
[ Zero Gradients ] โโ `optimizer.zero_grad()`
โ
[ **Cosine LR Annealing** ] โโ `scheduler = CosineAnnealingLR(optimizer, T_max, eta_min)`
โ
[ Next batch ]
Single card peak ~ 13โ15 GB VRAM -> Start your Experiment with FP16 + AMP
Whisper large-v3 has the same architecture as the previous large and large-v2 models, except for the following minor differences:
- The spectrogram input uses 128 Mel frequency bins instead of 80
- A new language token for Cantonese
- Each token output by
Attention carries global context information, whileFFN applies "fine-tuning" or "feature combination" to each tokento improve the feature quality at each position
Transformer Assembly Line
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Raw Donuts โ Community Check โ Solo Decoration โ Finished Donuts
(Input) (Attention) (FFN) (Output)
โ โ โ โ
โโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโ
โ Plain โโ โ Community โโ โ Solo โ โ โGourmet โ
โ Donuts โ โ Analysis โ โ Decoration โ โDonuts โ
โโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโ
โ โ โ โ
Xโ Xโ = Attention Xโ = FFN Output
1. Xโแตข = ฮฃโฑผ ฮฑแตขโฑผ ร V_j (Global Linear)
2. Xโแตข = WโยทReLU(WโยทXโแตข + bโ) + bโ (Local Nonlinear)
Attention: Convex combination โ Stays within input space
FFN: Nonlinear transformation โ Can transcend input space
Activation Function Characteristics Comparison
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโ
โFunction โ Smoothness โ Computational โ Gradient โ Performance โ
โ โ โ Complexity โ Properties โ โ
โโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโค
โ ReLU โ Non-smooth โ Minimal โ Sparse โ Baseline โ
โ GELU โ Smooth โ Moderate โ Dense โ Better โ
โ SwiGLU โ Smooth โ High โ Gated โ Best โ
โ Mish โ Very Smoothโ High โ Adaptive โ Very Good โ
โ Swish โ Smooth โ Moderate โ Self-gated โ Good โ
โ ELU โ Smooth โ Moderate โ Negative-safeโ Good โ
โโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโ
Input Features From Whisper
Fourier Transform
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Original Domain โ
โ - Pixels (Images) โ
โ - Samples (Audio, Signals) โ
โ - Tokens (Text) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โผ Fourier Transform
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Frequency Domain โ
โ - Low frequencies โ smooth โ
โ - High frequencies โ edges โ
โ - Harmonics โ fine details โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโ
โผ โผ โผ
โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
โ CNNs โ โ Transformersโ โ Speech/Imageโ
โโโโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโ
- Edges = HF - Sinusoidal pos. - STFT / spectrogram
- Smooth = LF encoding - Highlight textures
- Convolutions (frequency basis) - Recognize phonemes
simplified - Detect fine image details
in frequency
space
Teacher (Whisper-large-v2) Student
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Audio Input Audio Input
1 ร T samples 1 ร T samples
โ โ
โผ โผ
Whisper Encoder Whisper Encoder
1280-d hidden, T~1500 frames 768-d hidden, T~499 frames
(32 layers, FROZEN) (12 layers, FROZEN)
โ โ
โ โ
โโโโโโโโ Hidden States โโโโโโโโโโโโโโโโโโโโโโโโ โโโโ Projection Layer โโโโ
โ (B,1500,1280) โ (768โ1280) โ
โ โ โ
โผ โผ โผ
Whisper Decoder Whisper Decoder Aligned Hidden
(32 layers, FROZEN) (4 layers +/ LoRA) (B,1500,1280)
โ โ โ
โ โ โ
โผ โผ โ
Teacher Logits โโโโโโ Soft Targets โโโโโโโโโโถ Student Logits โ
(B,seq,vocab) (KL Loss) (B,seq,vocab) โ
โ T=temperature โ โ
โ โ โ
โ โผ โ
โ Hard Labels โโโ Ground Truth โ
โ (CTC Loss) โ
โ โ โ
โ โ โ
โ โผ โ
โ Student Loss โโโโ MSE Loss โโโโโโโ
โ โ (Hidden Align)
โ โ
โผ โผ
No parameter updates (LoRA) + Projection parameters
(Inference only) ONLY these are trained
Why Tโ499
- Whisper feature extractor
- The original audio (30 s) generates about 3000 frames of 80-dimensional log-Mel features at a granularity of 10 ms per frame
- Whisper first divides the original mono audio (30 seconds, 16 kHz) into several short segments
- Generate an 80-dimensional log-Mel feature every 10 ms
- 30 s / 0.01 s = 3000 frames
- These 3000 frames are still very dense. If Transformer processes them directly, the computational workload and memory requirements will be too high
- Before being fed into the Transformer encoder, these 3000 frames are first downsampled through a convolutional layer (stride=2), and then continuously merged or downsampled in the multi-layer Transformer block
- The final output length is about 3000 / 2 / 3 = 500 frames (actually 499 frames)
30 s audio
โ (extract 80-dim log-Mel every 10 ms)
3000 frames
โ (convolutional layer with stride=2)
1500 frames
โ (further down-sampling/merging inside the Transformer encoder โร3)
โ (Pooling or Conv1d: kernel_size=3, stride=3)
โ500 frames (actually 499 frames)
- Audio Signal Characteristics - Redundancy -> why can be compressed to T~499 frames
1. Audio frame rate is typically high
sample_rate = 16000 # 16 kHz sampling rate
frame_rate = 100 # 100 frames per second
frame_duration = 10 # 10 ms per frame
2. 30 seconds of audio
total_frames = 30 * frame_rate # 3000 frames
3. Adjacent frames are highly correlated
correlation_coefficient โ 0.9 # typical inter-frame correlation
- Always remember to do Automatic checkpoint saving
- !pip install -U bitsandbytes>=0.41.0
- Put Your Teacher model on CPU
- MIN_DURATION = 1.0
- MAX_DURATION = 30.0 # Same as Whispe maximum acceptance length
Orignial LoRA Paper
ฮW = A ยท B -> only low-rank increments are made to W_q and W_v in the attention
Choices of LoRA Injection
decoder.layers.*.encoder_attn.q_proj
decoder.layers.*.encoder_attn.v_proj
decoder.layers.*.self_attn.q_proj
decoder.layers.*.self_attn.v_proj
decoder.layers.*.encoder_attn.q_proj, encoder_attn.k_proj, encoder_attn.v_proj
decoder.layers.*.self_attn.q_proj, self_attn.k_proj, self_attn.v_proj
decoder.layers.*.fc2
decoder.layers.*.encoder_attn.q_proj, encoder_attn.k_proj, encoder_attn.v_proj, encoder_attn.out_proj
decoder.layers.*.self_attn.q_proj, self_attn.k_proj, self_attn.v_proj, self_attn.out_proj
decoder.layers.*.fc1, fc2
Temperature
- Initial pilot temperature:
T = - Search range:
[ ] - Optuna hyperparameter: include
tempas a tunable parameter - Guidance: prevent over-smoothing (i.e. avoid
T > 5)
Hard vs. Soft Labels in Knowledge Distillation
-
Hard Labels: one-hot vectors from ground truth
y = [0, โฆ, 1, โฆ, 0]
โข Strong supervision โ binary certainty
โข Forces correct classification -
Soft Labels: teacherโs softmax outputs
p_teacher = [0.6, 0.3, 0.1]
โข Confidence & uncertainty
โข Encodes inter-class similarity
Why num_workers Affects GPU Performance
The num_workers parameter in PyTorch DataLoader controls the number of CPU processes responsible for data loading and preprocessing. This directly impacts GPU utilization through data pipeline optimization
Performance Comparison
Single-threaded (num_workers=0)
- CPU: LoadโPreprocessโTransfer, GPU idle, LoadโPreprocessโTransfer
- GPU: Idle, Compute, Idle
Multi-threaded (num_workers=4)
- CPU: Continuous data preparation (4 parallel threads)
- GPU: Continuous computation (minimal idle time)
Key Insight
- Increasing num_workers enhances โCUDA kernel parallelismโ not by adding GPU parallelism, but by eliminating GPU starvation. Multiple CPU workers ensure the GPU receives a steady stream of preprocessed data, maximizing hardware utilization and reducing training time
- The optimal num_workers typically ranges from 2-4 per GPU, depending on CPU core count and I/O bottlenecks
CTC Loss - Hard Supervision - Here Cross-Entropy (CE) Loss since Whisper is Seq2Seq with Decoders
Since Whisper is a Seq2Seq model with Decoder, cross-entropy loss is employed here.
The decoder generates hidden state sequences at step $u$: \(\{\mathbf{d}_u\}_{u=1}^U\)
mapping to the target text sequence: \(\{y_u\}_{u=1}^U\)
using token-by-token one-to-one supervision:
- Token-to-Token Alignment Each step has a clear โcorrectโ next token, requiring no implicit alignment
- One-Step Supervision Cross-entropy is directly applied to the prediction distribution at each position $u$
- Direct Gradient Backpropagated from the output layer, enabling stable convergence
Cross-Entropy Loss Formula \(\mathcal{L}_{\mathrm{CE}} = -\sum_{u=1}^U \log P_\theta\bigl(y_u \mid y_{<u}, \mathbf{h}_{1:T}\bigr)\)
where:
- $\mathbf{h}_{1:T}$ represents the audio representation output by the encoder
- $y_{<u}=(y_1,\dots,y_{u-1})$ are the previously generated tokens
- $U$ is the target sequence length
Following the encoderโs output audio frame sequence: \(\{\mathbf{h}_t\}_{t=1}^T\)
mapping to transcript tokens: \(\{y_u\}_{u=1}^U\)
without explicit frame-level labels:
- Frame-to-Token Alignment Automatic alignment from audio frames to text tokens
- Marginalizing Paths Marginalizing over all possible alignment paths
- Gradient Signal Gradient signals propagate to all relevant audio frames through attention mechanisms
KL Distillation Loss - Soft Supervision
KL Distillation Loss compares the teacherโs and studentโs posterior distributions over labels at each time-step in latent space
- Soft Distribution Matching
- Preference Transfer
- Capturing Uncertainty
Since the softmax outputs retain probabilities for all tokens, the KL term transfers the teacherโs uncertainty patternsโe.g., when the teacher is unsure between two phonemes, the student learns to mirror that ambiguity
Total Loss
\[L_{\text{total}} = L_{\mathrm{CE}} + 0.xx\,T^{2}\,L_{\mathrm{KD}} + \alpha\,L_{\mathrm{hidden\_align}}\]where
\[\begin{aligned} & L_{\mathrm{CE}} &&\text{is the hard CE loss}\\ & L_{\mathrm{KD}} = \mathrm{KL}\bigl(p_{\rm teacher}^{T}\;\|\;p_{\rm student}^{T}\bigr) &&\text{is the softened KL-divergence loss with temperature }T\text{ and weight }\0.8 (*the same as student backbone)\\ & L_{\mathrm{hidden\_align}} &&\text{is the projected hidden-state MSE loss with weight }\alpha \end{aligned}\]Hyperparameter Optimization
With 15hrs dataset experiment, we used 50 rounds to run a โwarm-upโ for no problem. If you want to perform large-scale tuning in a production environment, it is recommended to increase n_trials to 50-100
import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler
def objective(trial):
# Distillation loss weights
alpha = trial.suggest_loguniform("alpha", 1e-3, 1e1)
beta = trial.suggest_loguniform("beta", 1e-3, 1e1)
# Optimization hyperparameters
lr = trial.suggest_loguniform("lr", 1e-5, 1e-3)
batch_size= trial.suggest_categorical("batch_size", [4, 8, 16, 32])
dropout = trial.suggest_float("dropout", 0.0, 0.5)
# Train & evaluate with these settings (implement train_and_evaluate accordingly)
wer = train_and_evaluate(
alpha=alpha,
beta=beta,
learning_rate=lr,
batch_size=batch_size,
dropout=dropout,
pruner=trial # for early stopping
)
return wer
# Pruner to stop unpromising trials early
pruner = MedianPruner(n_startup_trials=5, n_warmup_steps=100)
sampler = TPESampler()
study = optuna.create_study(
direction="minimize",
sampler=sampler,
pruner=pruner
)
study.optimize(objective, n_trials=100)
print("Best hyperparameters:", study.best_params)
PCA vs. t-SNE vs. UMAP vs. DTW
- Local weights
w_ij = exp(โ(d(x_i, x_j) โ ฯ_i) / ฯ_i)
w_ji = exp(โ(d(x_j, x_i) โ ฯ_j) / ฯ_j)
- Fuse into a single โstrengthโ score
ฮผ_ij = w_ij + w_ji โ w_ij * w_ji
Background Knowledge 2
[Training Neural Network]
โ
โผ
[Problem: Overfitting]
โ model performs well on train set
โโ poor generalization on unseen data
โผ
[Regularization Strategies]
โ
โโ L1 Regularization โ add |w| penalty
โ encourages sparsity, feature selection
โ
โโ L2 Regularization (Weight Decay)
โ adds wยฒ penalty, smooths weights
โ reduces variance, stabilizes gradients
โ
โโ Early Stopping
โ monitor validation loss โ stop early
โ
โโ Data Augmentation
โ enlarge dataset (flip, crop, color jitter)
โ improves robustness & invariance
โ
โโ Dropout
randomly deactivate neurons (mask m)
prevents co-adaptation
during inference: scale activations by p
โผ
[Normalization Layers]
โ
โโ Batch Normalization (BN)
โ normalize activations per mini-batch
โ ฮผ_B, ฯ_B computed over batch samples
โ then apply ฮณ (scale) + ฮฒ (shift)
โ allows larger learning rate & faster training
โ
โโ Layer Normalization (LN)
โ normalize across features, not batch
โ used in Transformers (batch-size independent)
โ
โโ Effect:
stabilizes gradient flow
reduces internal covariate shift
improves convergence speed
โผ
[Residual Connections]
โ
โโ skip connection y = F(x) + x
eases gradient propagation
enables very deep CNNs (ResNet)
โผ
[Combined Strategy]
โ
โโ Regularization (L1/L2)
โโ Dropout
โโ Batch Normalization
โโ Data Augmentation
โผ
[Result]
โ
โโ High generalization, stable training,
smoother optimization landscape,
reduced overfitting risk
[Closed-Set Classification]
โ
โโ assumes all test classes are known
model outputs one of O fixed labels
โผ
[Open-Set Problem]
โ
โโ real-world contains unknown categories
โโ standard SoftMax โ overconfident wrong predictions
โโ need to reject unseen (unknown) samples
โผ
[Goal: Open-Set Recognition]
โ
โโ recognize known classes correctly
โโ detect / reject unknown classes (OOD)
โผ
[Two Main Paradigms]
โ
โโ Two-Stage OSR
โ Stage 1: detect unknowns (OOD)
โ Stage 2: classify known samples
โ
โโ Integrated OSR
single model learns known + reject class
adds โunknownโ logits or rejection threshold
โผ
[Core Approaches]
โ
โโ OSDN (Open-Set Deep Network)
โ compute Mean Activation Vector (MAV)
โ distance D_o = ||ฯ - ฮผ_o||
โ fit EVT (Extreme Value Theory) model to tails
โ
โโ GHOST (Gaussian Hypothesis OSR)
โ per-class Gaussian modeling in feature space
โ normalize logits by (ฮผ_o, ฯ_o)
โ provides calibrated confidence
โ
โโ Garbage / Background Class
โ add class yโ for โnone of the aboveโ
โ weighted loss: ฮป_ฯ = N / ((O+1)N_ฯ)
โ
โโ Entropic Open-Set Loss
โ for unknowns, enforce uniform SoftMax
โ target: t_o = 1/O for all o
โ equalizes logits โ high entropy
โ
โโ Confidence Thresholding
use ฮถ threshold on SoftMax
accept if max(yฬ_o) > ฮถ, else reject
โผ
[Training]
โ
โโ Known samples: one-hot targets
โโ Unknown samples: uniform targets
โโ Loss combines CE + Entropic term
โผ
[Evaluation Metrics]
โ
โโ CCR (Correct Classification Rate)
โ true positives among known samples
โ
โโ FPR (False Positive Rate)
โ unknowns misclassified as knowns
โ
โโ OSCR Curve (CCR vs FPR)
area under curve (AUOSCR) = performance
โผ
[Modern Implementations]
โ
โโ ImageNet-based OSR protocols (P1โP3)
โโ Feature-space Gaussian models (GHOST)
โโ Entropic loss + background class hybrid
โโ Evaluation by AIML UZH / WACV 2023
โผ
[Outcome]
โ
โโ OSR enables reliable recognition under uncertainty:
โI know what I know โ and I know what I donโt.โ
| Stage | Process | Mathematical Meaning | Intuitive Explanation |
|---|---|---|---|
| Forward Process | Add Gaussian noise to clean trajectories \((x_0 \rightarrow x_T)\). | \(q(x_t \mid x_{t-1}) = \mathcal{N}(\sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I)\) | Gradually โscramblesโ a human driving path โ this step is fixed and not learned. |
| Reverse Process | Learn to denoise noisy trajectories \((x_T \rightarrow x_0)\) conditioned on perception \(c\). | \(p_\theta(x_{t-1} \mid x_t, c) = \mathcal{N}(\mu_\theta(x_t, t, c), \Sigma_\theta)\) | The model learns to โrestore order from noise,โ reconstructing human-like trajectories that fit the scene. |
| Prior-Guided Learning | Add an Anchored Gaussian prior for realistic initialization. | \(x_T \sim \mathcal{N}(\mu_{anchor}, \sigma^2 I)\) | The model doesnโt predict trajectories directlyโit learns to move toward the probability distribution of human driving behaviors. |
Temporal Alignment Leakage
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Temporal Downsampling Effect โ
โ โ
โ Teacher Sequence (1500 frames) โ
โ โโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโฌโโ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โโโดโโดโโดโโดโโดโโดโโดโโดโโดโโดโโดโโดโโดโโดโโดโโดโโดโโ โ
โ โ 3:1 compression โ
โ Student Sequence (499 frames) โ
โ โโโโโโโฌโโโโโโฌโโโโโโฌโโโโโโฌโโโโโโฌโโโโโโ โ
โ โ โโโ โ โโโ โ โโโ โ โโโ โ โโโ โ โโโ โ โ
โ โโโโโโโดโโโโโโดโโโโโโดโโโโโโดโโโโโโดโโโโโโ โ
โ โ โ
โ Information "leaks" to adjacent windowsโ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
| Method | Memory Usage | Training Speed |
|---|---|---|
| Normal Training | High (store all activations) | Fast (no recomputation needed) |
| Checkpointing | Low (store partial activations) | Slow (extra recomputation needed) |
Gradient Checkpointing
Forward Pass:
Input โ [Layer1: store] โ [Layer2: recompute later] โ [Layer3: recompute later] โ Output
Backward Pass:
Recompute Layer2 & Layer3 forward
Use recomputed activations โ compute gradient
Use Layer1 activation โ compute gradient
References
- [FitNets]
- 1991 - Adaptive Mixtures of Local Experts
- 2022 - Knowledge Distillation via Hypersphere Features Distribution Transfer
- 2025 - An Intuitive Overview of Few-Step Diffusion Distillation
- [2025 - TAID]
- [Polyscope - Toolkit for demos]
- 2025 - Efficient Distillation of Classifier-Free Guidance using Adapters
- 2025 - AXLearn: Modular Large Model Training on Heterogeneous Infrastructure
- 2013 - Efficient Estimation of Word Representations in Vector Space
- 2014 - Adam: A Method for Stochastic Optimization
- 2016 - Information Geometry and Its Applications
- 2015 - Matrix Backpropagation for Deep Networks With Structured Layers
- 2019 - Auxiliary teacher - Improved Knowledge Distillation via Teacher Assistant
- 2023 - Sub-sentence encoder: Contrastive learning of propositional semantic representations
- 2023 - Accelerating Large Language Model Decoding with Speculative Sampling
- ASR WER + Latency
- 2021 - 1-bit Adam: Communication Efficient Large-Scale Training with Adamโs Convergence Speed
- 2024 - SqueezeAttention: 2D Management of KV-Cache in LLM Inference via Layer-wise Optimal Budget
- 2025 - Qwen/Qwen3-235B-A22B-Instruct-2507