2025 - Thesis - Deep Learning

Latent Space, UZH AI, AI Center


Diffusion Models

def compute_distillation_loss()
    cos_sim = (s * t).sum(dim=-1).clamp(-1 + eps, 1 - eps)
    geo_loss = torch.acos(cos_sim).mean()
    ...
    total_loss = ()
return total_loss, ce_loss.item(), kl_loss.item(), geo_loss.item()


Backpropagation

Stage Operation Expression Meaning
Forward Pass Compute layer outputs \(z^{(l)} = W^{(l)} a^{(l-1)} + b^{(l)}, \quad a^{(l)} = f(z^{(l)})\) Obtain network predictions
Compute Loss Compute error \(L = \tfrac{1}{2}|\hat{y} - y|^2\) Measure output error
Backward Pass Backpropagate from output layer \(\delta^{(L)} = (\hat{y} - y) \odot f'(z^{(L)})\) Compute output-layer gradient
ย  Propagate to previous layers \(\delta^{(l)} = (W^{(l+1)})^T \delta^{(l+1)} \odot f'(z^{(l)})\) Compute hidden-layer gradients
Gradient Computation Compute parameter gradients \(\frac{\partial L}{\partial W^{(l)}} = \delta^{(l)} (a^{(l-1)})^T\) Obtain weight gradients
Update Update parameters \(W^{(l)} \leftarrow W^{(l)} - \eta \frac{\partial L}{\partial W^{(l)}}\) Optimize via gradient descent


Optimal Method as Below with Hash Value

Problem Original Complexity Optimal Complexity Optimal Method Further Optimization
Check Anagram O(n) O(n) Counter / Hash Map Cannot Be Improved
Dictionary Anagram Lookup O(M ร— N log N) O(M ร— N) Hash Value + Character Count Key Significantly Optimizable


Hash Map and Graph for Optimization

Analogy Hash Map in Data Structures Dynamic Programming / Graph in Algorithms
Essence Trade space for time โ€” achieve O(1) lookup. Trade state-graph computation for optimal solution โ€” typically O(N ร— M).
Advantage Globally optimal method for key lookup. Globally optimal framework for decision and optimization.
Limitation Only applicable to keyโ€“value lookup problems. Only applicable to decomposable problems with optimal substructure.
Conclusion The most efficient in the lookup domain. The most general but not universal in the optimization/decision domain.


Languages

Dimension Rust Go (Golang) C++ Python
Essentially OOP โœ— (OOP-like, but primarily functional) โœ— (Has OOP features, but fundamentally procedural and concurrent) โœ“ (Classic, strongly object-oriented) โœ“ (Dynamic, fully object-oriented)
Programming Paradigm Multi-paradigm: Primarily functional + systems, supports OOP traits Procedural + concurrent, limited OOP Multi-paradigm: Strongly object-oriented + generic Multi-paradigm: Object-oriented + scripting
Type System Static, compiled Static, compiled Static, compiled Dynamic, interpreted
Memory Management No GC; uses ownership + borrow checker Automatic GC Manual (new/delete) or smart pointers Automatic GC
Concurrency Model Lock-free, type-safe (โ€œfearless concurrencyโ€) Goroutines + channels (CSP model) Multithreading with manual locks GIL limits true multithreading
Performance Nearly equal to C++ Close to C++, slightly slower (GC overhead) Fastest native performance Slowest (interpreted)
Safety Compile-time memory safety; prevents data races Memory-safe but not thread-safe Very fast but error-prone (dangling pointers, overflows) Safe but slow
Learning Curve Steep (requires ownership understanding) Easy (simple syntax) Steep (complex syntax and templates) Easiest (beginner-friendly)
Compile Speed Slow Fast Slow (especially for large projects) None (interpreted)
Ecosystem Young but growing fast (systems, embedded, backend) Mature (cloud, DevOps, microservices) Broadest (systems, games, embedded) Broadest (AI, data science, web)
Applications System programming, secure backend, embedded, WebAssembly Cloud-native systems, microservices, networking OS, game engines, graphics AI/ML, scripting, automation, data analysis
Philosophy โ€œZero-cost abstractionโ€ โ€” safety + performance โ€œPragmatic simplicityโ€ โ€” simplicity + efficiency โ€œTotal controlโ€ โ€” performance + flexibility โ€œEase of useโ€ โ€” simplicity + rapid prototyping
Key Projects Firefox, Tokio, AWS Firecracker Docker, Kubernetes, Terraform Unreal Engine, Chrome, TensorRT PyTorch, TensorFlow, YouTube


Latent Space Structure

Space Core Definition Difference from Others Application Domains
Hilbert Space A complete inner product space where lengths, angles, and projections are well-defined Serves as the foundational โ€œperfectโ€ geometric space; all others are generalizations or relaxations Quantum mechanics, signal processing, optimization, machine learning
Banach Space A complete normed vector space, not necessarily with an inner product Has length but no angles Non-Euclidean optimization, functional analysis
Riemannian Manifold Each point has a local inner-product space (tangent space) Locally Hilbert, globally curved General relativity, geometric deep learning
Symplectic Space Equipped with an area-preserving bilinear form No distance, only conserved quantities Classical mechanics, Hamiltonian systems
Topological Space Defined only by neighborhood relationships, no metric required No notion of length or angle Generalized geometry, continuity, homotopy theory
Metric Space A set with a defined distance function d(x, y) Hilbert space is a special case Clustering, manifold learning, distance-metric learning
Probability Space A measurable space (ฮฉ, F, P) defining random events Describes the geometry of events Probability theory, information geometry, Bayesian inference
Information Manifold A Riemannian manifold on probability distributions Uses Fisher information metric Statistical inference, information geometry, variational inference
Kรคhler / Complex Space Complex structure + symmetric geometry + metric Conformal generalization of Hilbert space Quantum geometry, string theory, complex optimization


Algorithms

โ”œโ”€โ”€ I. Data Structures
โ”‚   โ”œโ”€โ”€ Stack, Queue, <HashMap>, LinkedList
โ”‚
โ”œโ”€โ”€ II. Algorithmic Patterns 
โ”‚   โ”œโ”€โ”€ Two Pointers
โ”‚   โ”œโ”€โ”€ Sliding Window
โ”‚   โ”œโ”€โ”€ Prefix Sum
โ”‚   โ”œโ”€โ”€ Monotonic Stack / Queue
โ”‚   โ”œโ”€โ”€ Binary Search Patterns
โ”‚
โ”œโ”€โ”€ III. Complex Algorithms  
โ”‚   โ”œโ”€โ”€ <Dynamic Programming (DP)>
โ”‚   โ”œโ”€โ”€ <Graph Theory (DFS/BFS/Dijkstra)>
โ”‚   โ”œโ”€โ”€ Recursion / Backtracking
โ”‚   โ”œโ”€โ”€ Greedy Algorithms
โ”‚   โ”œโ”€โ”€ Divide & Conquer
โ”‚
โ””โ”€โ”€ IV. Problem Integration
    โ”œโ”€โ”€ Hard composite problems
    โ”œโ”€โ”€ Algorithm design questions

Diffusion, Stable Diffusion, Rectified Flow

Dimension Vanilla Diffusion Model (DDPM / DDIM) Stable Diffusion (Latent Diffusion Model, LDM) Rectified Flow (Flow Matching)
Start Distribution Starts from pure Gaussian noise N(0, I) Starts from latent-space noise (compressed through an encoder) Starts from any distribution point (usually N(0, I), but customizable)
Generative Process Multi-step denoising: reverses the noise diffusion process (xโ‚œโ‚‹โ‚ = fฮธ(xโ‚œ, t)) Multi-step denoising in latent space (computationally cheaper) (zโ‚œโ‚‹โ‚ = fฮธ(zโ‚œ, t)) Continuous one-step flow: learns an ODE (dxโ‚œ/dt = vฮธ(xโ‚œ, t))
Mathematical Formulation Discrete Markov chain (reverse SDE) Discrete SDE in latent space Continuous ODE or flow field
Computational Complexity Multi-step sampling (20โ€“1000 steps) Multi-step but faster in latent space (20โ€“50 steps) Single continuous integration step
Advantages High generation quality; theoretically grounded High resolution, lightweight, and controllable (supports text prompts) Fast convergence, continuous generation, minimal mode collapse
Limitations Slow sampling; many denoising steps required Strong dependence on encoder design and latent structure Sensitive training stability; harder conditional control
Representative Papers / Applications DDPM (Ho et al., 2020); DDIM (Song et al., 2021) LDM / Stable Diffusion (Rombach et al., CVPR 2022) Flow Matching / Rectified Flow (Liu et al., ICLR 2023)


Optimization

Component / Technique Description Implementation
Optimizer Gradient-based weight updates with decoupled weight decay to improve stability on large models. AdamW optimizer with lr=2.6e-4 and default ฮฒ=(0.9, 0.999); stable for transformer-like models.
Learning-Rate Schedule Smooth cosine decay to avoid abrupt gradient shocks after warm-up. get_cosine_schedule_with_warmup(opt, 1000, 10000) โ€” warm-up = 1 k steps, total = 10 k steps.
Warm-Up Phase Gradually increases learning rate and KL weight to prevent early divergence in distillation. Linear warm-up for both learning rate and ฮปโ‚–โ‚— (0 โ†’ 0.020584 during first 1000 steps).
Mixed-Precision Training Uses half precision (torch.amp.autocast) to reduce GPU memory and improve throughput. Forward/backward passes wrapped in autocast, scaled by GradScaler() for numerical stability.
Gradient Clipping Prevents exploding gradients in long sequences. torch.nn.utils.clip_grad_norm_(params, 1.0) each iteration.
Loss Function (Multi-Objective) Balances semantic accuracy, distribution matching, and geometric alignment. Total loss: L = LCE + ฮปโ‚–โ‚— ยท LKL + ฮปGeo ยท LGeo, with ฮปGeo = 0.969909 constant.
CE Loss Supervised label alignment ensuring correct transcription semantics. Cross-entropy between student predictions and true tokens.
KL Divergence Soft-target distillation to transfer probability distributions from teacher logits. F.kl_div(log_softmax(student/T), softmax(teacher/T)) ยท Tยฒ, T = 2.0.
Riemannian Geodesic Loss Aligns feature geometry on curved manifold instead of flat Euclidean MSE. Geodesic distance = acos(cos_sim) between normalized hidden states.
Model Architecture (Student) Lightweight CNN + Transformer hybrid for speech sequence modeling. Two 1-D Conv layers โ†’ 6 Transformer encoder blocks โ†’ linear output head.
Teacher Model Provides target logits and hidden features for distillation. Frozen Whisper-large-v2 (FP16) encoder-decoder model.


distil_run_cell2.7.2/
โ”‚
โ”œโ”€โ”€ tb/                           โ† TensorBoard log files
โ”‚   โ”œโ”€โ”€ events.out.tfevents...    
โ”‚
โ”œโ”€โ”€ adapter_final/                โ† Final trained student model
โ”‚   โ””โ”€โ”€ student_model.pt
โ”‚
โ”œโ”€โ”€ checkpoint.pt                 โ† Intermediate checkpoint (used if training was interrupted)
โ”œโ”€โ”€ training_history.json         โ† Recorded training and validation loss curves
โ”œโ”€โ”€ best_params.json              โ† Best hyperparameter record (e.g., kl_weight, geo_weight)
โ””โ”€โ”€ training_config.json          โ† Training configuration and setup details


Structure

Machine Learning Fundamentals
      โ”‚
      โ”œโ”€โ”€ Data โ†’ Representation โ†’ Optimization โ†’ Generalization
      โ”‚       โ”œโ”€ Focus: Data quality, bias mitigation, and representation learning
      โ”‚       โ”œโ”€ Link to Gemini: multimodal data fusion (text, audio, vision, code)
      โ”‚       โ””โ”€ Goal: Learn unified latent spaces that enable reasoning across modalities
      โ”‚
      โ”œโ”€โ”€ Deep Learning (CNN / RNN / Transformer)
      โ”‚       โ”œโ”€ Forward & backward propagation as differentiable computation graphs
      โ”‚       โ”œโ”€ Initialization, normalization, regularization โ†’ stability & convergence
      โ”‚       โ”œโ”€ Loss design + learning rate scheduling โ†’ control of optimization dynamics
      โ”‚       โ””โ”€ Transformer family as universal sequence learners (foundation for Gemini)
      โ”‚
      โ”œโ”€โ”€ Optimization & Geometry
      โ”‚       โ”œโ”€ Gradient-based optimization viewed as navigating the loss landscape
      โ”‚       โ”œโ”€ Flat vs. sharp minima โ†’ generalization and robustness trade-offs
      โ”‚       โ”œโ”€ Riemannian geometry in embedding space โ†’ alignment on curved manifolds
      โ”‚       โ””โ”€ Connection: Geminiโ€™s embedding consistency and representation curvature
      โ”‚
      โ”œโ”€โ”€ Model Compression & Distillation
      โ”‚       โ”œโ”€ Knowledge transfer from large to small models (teacher โ†’ student)
      โ”‚       โ”œโ”€ Soft vs. hard labels โ†’ probabilistic vs. symbolic supervision
      โ”‚       โ”œโ”€ LoRA / Adapter-based fine-tuning โ†’ parameter-efficient adaptation
      โ”‚       โ”œโ”€ Trade-offs: accuracy โ†” latency โ†” memory footprint โ†” energy efficiency
      โ”‚       โ””โ”€ Relevance: LearnLM and Gemini use adapter-tuned submodels for learning tasks
      โ”‚
      โ””โ”€โ”€ ML Engineering & Responsible AI
              โ”œโ”€ Data pipelines, reproducibility, evaluation, and continuous integration
              โ”œโ”€ Monitoring, checkpointing, scalable deployment on distributed accelerators
              โ”œโ”€ Safety alignment and interpretability โ€” understanding model decisions
              โ”œโ”€ Evaluation beyond accuracy: robustness, fairness, value alignment
              โ””โ”€ Ethical ML engineering: accountability and transparency in large systems

Time

Big-O Name Typical Example
O(1) Constant time Accessing array element
O(log n) Logarithmic time Binary search
O(n) Linear time Single loop through array
O(n log n) Linearithmic time Merge sort, Quick sort
O(nยฒ) Quadratic time Nested loops, Bubble sort
O(nยณ) Cubic time Triple nested loops
O(2โฟ) Exponential time Subset / permutation generation
O(n!) Factorial time Traveling Salesman, N-Queens
O(bแตˆ) Branching search DFS in state tree with branching b and depth d


O(n):โ€ƒโ€ƒโ€ƒโ€ƒโ€ƒ      โ€ƒO(n log n):โ€ƒโ€ƒโ€ƒโ€ƒโ€ƒ           โ€ƒO(nยฒ):
loop โ†’ โ†’ โ†’โ€ƒโ€ƒโ€ƒ   โ€ƒdivide โ†’ sort โ†’ mergeโ€ƒโ€ƒโ€ƒ โ€ƒ double loop โ†’ compare all
(one pass)โ€ƒโ€ƒโ€ƒ   โ€ƒ(log layers ร— n work)โ€ƒโ€ƒโ€ƒโ€ƒ   (each pair compared)
linear scanโ€ƒโ€ƒ   โ€ƒ merge / quick sortโ€ƒโ€ƒโ€ƒโ€ƒโ€ƒ    bubble / selection sort

Space

Big-O Name Typical Example
O(1) Constant space In-place swap, variable assignment
O(log n) Logarithmic space Recursive binary search
O(n) Linear space Storing array, dynamic programming 1-D
O(nยฒ) Quadratic space 2-D matrix, Floyd-Warshall DP
O(nยณ) Cubic space 3-D DP table
O(2โฟ) Exponential space Memoization of all subsets


Data Loader

Stage Code Section Padding Applied Explanation
โ‘  Dataset structure check os.walk() file scan No Only scans file names, counts, and sizes.
โ‘ก Load audioโ€“text pairs pairs = load_audio_text_pairs(DATA_DIR) No Generates file paths, no tensor involved.
โ‘ข Build Dataset dataset = LibriSpeechLocalDataset(pairs, processor) Not yet Each sample is returned separately, no unified length.
โ‘ฃ Build DataLoader train_loader = DataLoader(...) Yes (here) Padding is applied when combining samples into a batch.
โ‘ค Train model for step, batch in enumerate(train_loader): Already padded Batch tensors have equal dimensions for training.

Protocol and Ports

[You: MacBook]  โ†โ†’  [Encrypted Tunnel (AES)]  โ†โ†’  [Mac Studio Server]
         โ†•                             โ†•
   ssh user@ip_address         sshd (daemon listening on port 22)
Protocol Port Purpose
HTTP 80 Web traffic
HTTPS 443 Secure web traffic
FTP 21 File transfer
SSH 22 Secure remote shell


Function Command Example Description
Remote Login ssh user@192.xxx.x.xx Open a command-line session on a remote computer
File Transfer scp file.txt user@host:/path/ Securely copy a file to a remote machine
Port Forwarding ssh -L 8080:localhost:80 user@host Map a remote port to a local port through an encrypted tunnel
Passwordless Login Public key authentication (~/.ssh/id_rsa.pub) Automatically authenticate using key pairs
Automation Control Use SSH to execute commands or sync data in bulk Common in DevOps or HPC environments


Optimizers

Era Optimizer Year Core Innovation Key Equation / Concept Limitation Solved Remarks
Classical GD Gradient Descent (GD) 1951 Update weights along the negative gradient direction \(w_{t+1} = w_t - \eta \nabla L(w_t)\) None (too simple) Foundation of all optimizers
ย  Stochastic Gradient Descent (SGD) 1983 Uses random mini-batches to improve efficiency \(\nabla L(w_t) \approx \frac{1}{\lvert B \rvert} \sum_{i \in B} \nabla L_i(w_t)\) High variance and slow convergence Enables online / large-scale learning
Momentum Era SGD + Momentum 1989 Adds velocity term to accumulate past gradients \(v_t = \beta v_{t-1} + (1-\beta)\nabla L(w_t), \quad w_{t+1} = w_t - \eta v_t\) Oscillations in narrow valleys Faster convergence, physics-inspired
ย  Nesterov Accelerated Gradient (NAG) 1991 Looks ahead using gradient of estimated future position \(v_t = \beta v_{t-1} + (1-\beta)\nabla L(w_t - \eta \beta v_{t-1})\) Overshooting in Momentum Smoother convergence and stability
Adaptive Learning Adagrad 2011 Per-parameter adaptive learning rate \(G_t = \sum_{\tau=1}^{t} g_\tau^2, \quad \eta_{t,i} = \frac{\eta}{\sqrt{G_{t,i}}+\epsilon}\) Manual learning rate tuning Excellent for sparse features (NLP)
ย  RMSProp 2012 Exponentially weighted moving average of squared gradients \(v_t = \rho v_{t-1} + (1-\rho)g_t^2, \quad w_{t+1} = w_t - \frac{\eta}{\sqrt{v_t+\epsilon}}g_t\) Adagradโ€™s decaying rate problem Stable for non-stationary objectives
Modern Standard Adam 2014 Combines Momentum and RMSProp \(m_t = \beta_1 m_{t-1} + (1-\beta_1)g_t, \quad v_t = \beta_2 v_{t-1} + (1-\beta_2)g_t^2\) \(\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t}, \quad w_{t+1} = w_t - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\) Gradient noise and curvature imbalance Default optimizer for most deep networks
ย  AdamW 2017 Decouples weight decay from gradient update \(w_{t+1} = w_t(1-\eta\lambda) - \eta \frac{m_t}{\sqrt{v_t}+\epsilon}\) L2 regularization bias in Adam Default for Transformer / LLM training
Geometry-Aware & Large Batch LARS (Layer-wise Adaptive Rate Scaling) 2018 Layer-wise adaptive learning rate \(\eta_l = \eta \frac{|w_l|}{|g_l|+\epsilon}\) Scale mismatch in large-batch training Used in ResNet / ImageNet large-batch setups
ย  LAMB (Layer-wise Adaptive Moments) 2019 Extends LARS with Adam-style moments \(r_t = \frac{|w_t|}{|\hat{m}_t / (\sqrt{\hat{v}_t}+\epsilon)|}, \quad w_{t+1}=w_t - \eta r_t \frac{\hat{m}_t}{\sqrt{\hat{v}_t}+\epsilon}\) Poor scaling of Adam for huge batches Core optimizer for BERT, GPT
Variance Rectification & Belief Models RAdam 2019 Rectifies variance of adaptive learning rate \(\eta_t = \eta \frac{\sqrt{(1-\beta_2^t)/(1-\beta_2)}}{\sqrt{v_t}+\epsilon}\) Instability in early training More robust warm-up-free Adam
ย  AdaBelief 2020 Tracks belief in gradient direction \(v_t = \beta_2 v_{t-1} + (1-\beta_2)(g_t - m_t)^2\) Gradient over-smoothing Better generalization for small datasets
Second-Order & Natural Gradient K-FAC (Kronecker-Factored Approximate Curvature) 2015โ€“2023 Approximates curvature via blockwise Kronecker products \(F^{-1} \approx A^{-1} \otimes B^{-1}, \quad w_{t+1} = w_t - \eta F^{-1}\nabla L\) Ignores curvature in SGD/Adam Faster convergence, heavy memory use
ย  Shampoo 2021 Matrix preconditioning per layer \(G_t = \sum_{\tau=1}^{t} g_\tau g_\tau^\top, \quad W_{t+1}=W_t - \eta G_t^{-1/2}\nabla L\) Slow convergence on ill-conditioned loss Improves conditioning for large models
Modern LLM Optimizers Lion 2023 Momentum with sign-based updates \(w_{t+1} = w_t - \eta \, \text{sign}(\beta_1 m_t + (1-\beta_1)g_t)\) Over-adaptation of Adam Efficient and strong generalization for LLMs
ย  Sophia 2023 Second-order curvature-aware optimizer \(w_{t+1} = w_t - \eta \frac{g_t}{\sqrt{h_t+\epsilon}}, \quad h_t \approx \text{diag}(H_t)\) Slow convergence in large-scale Adam State-of-the-art for Transformer training


Riemannian Projector, Geodesic Loss

class RiemannianProjector(nn.Module):
    def __init__(self, in_dim=768, out_dim=1280):
        ...
    def forward(self, x):
        x = self.map(x)
        return F.normalize(x, dim=-1)

cos_sim = (x*y).sum(-1)
loss = acos(cos_sim)


Teacher (Whisper-large-v2, frozen)
        โ”‚
        โ–ผ
Student (<Structure-free> Student from the teacher + LoRA adapters)
        โ”‚
        โ”œโ”€โ”€ CE loss (labels supervision)
        โ”‚       โ†‘
        โ”‚       โ””โ”€โ”€ Hard labels = ground truth text
        โ”‚           (e.g. โ€œHello worldโ€ from dataset)
        โ”‚
        โ”œโ”€โ”€ KL loss (soft logits distillation)
        โ”‚       โ†‘
        โ”‚       โ””โ”€โ”€ Soft labels = teacherโ€™s predicted probabilities
        โ”‚           (e.g. P(โ€œhelloโ€)=0.62, P(โ€œheyโ€)=0.31, P(โ€œhaloโ€)=0.07)
        โ”‚
        โ””โ”€โ”€ Geo loss (Riemannian alignment)
                โ†‘
                โ””โ”€โ”€ Aligns latent embeddings on a curved manifold
                    (ensures student follows teacherโ€™s geometry)
        โ†“
   Optimizer (AdamW + Cosine LR)
        โ†“
  LoRA Adapter Checkpoint
        โ†“
Evaluation (WER / RTF / Memory)


s_hid = student_proj(s_out.encoder_last_hidden_state)
t_hid = normalize(t_out.encoder_last_hidden_state)
geo = geodesic_distance_on_sphere(s_hid, t_hid)


TAID

Initial training (step=0): ฮป=0.1 intermediate = 0.9 * student_probs + 0.1 * teacher_probs

โ†’ Mainly learn the studentโ€™s own distribution

Mid-training (step=400): ฮปโ‰ˆ0.5 intermediate = 0.5 * student_probs + 0.5 * teacher_probs

โ†’ Balanced learning

Late training (step=800): ฮป=0.9 intermediate = 0.1 * student_probs + 0.9 * teacher_probs

โ†’ Mainly learn the teacherโ€™s distribution


Background Knowledge

RL On Diffusion

I. Base Diffusion Backbone (Generative Prior)

Input (xโ‚€ = real data sample: image, trajectory, audio, 3D scene)
      โ†“
Forward Diffusion Process (adds Gaussian noise)
      โ†“
xโ‚ โ† โˆšฮฑโ‚ยทxโ‚€ + โˆš(1โˆ’ฮฑโ‚)ยทฮตโ‚
xโ‚‚ โ† โˆšฮฑโ‚‚ยทxโ‚ + โˆš(1โˆ’ฮฑโ‚‚)ยทฮตโ‚‚
โ‹ฎ
x_T โ‰ˆ pure Gaussian noise N(0, I)
      โ†“
Reverse Denoising Process (parameterized by neural network ฮต_ฮธ)
      โ†“
x_{tโˆ’1} = (x_t โˆ’ โˆš(1โˆ’ฮฑ_t)ยทฮต_ฮธ(x_t, t, cond)) / โˆšฮฑ_t + ฮทยทฯƒ_t
      โ†“
UNet / Transformer backbone โ†’ learns to reconstruct xโ‚€

II. Policy Representation via Diffusion

Environment State s_t
      โ†“
Noise z_t ~ N(0, I)
      โ†“
Diffusion Policy Network ฮต_ฮธ(s_t, z_t, t)
      โ†“
Sample Action a_t = Denoise(z_t | s_t)
      โ†“
Execute Action in Environment โ†’ Receive Reward r_t
      โ†“
Collect Trajectory ฯ„ = {s_t, a_t, r_t}

Project 1 Visualization

IV. Reward-Guided Diffusion Training (Diffusion Policy Optimization)

For each episode:
  1. Sample noise x_T ~ N(0, I)
  2. Run reverse diffusion (ฮต_ฮธ) conditioned on state s_t
  3. Generate predicted action trajectory xโ‚€
  4. Execute in environment โ†’ collect reward R
  5. Compute loss:
         L_total = L_diffusion + ฮปยทL_RL
         L_RL = โˆ’ E[R(ฯ„)]
  6. Backpropagate through ฮต_ฮธ network


Diffusion Policy, Decision Diffuser

Random Noise in Action Space
      โ†“
Diffusion or Flow Process
      โ†“
Denoising Steps / Continuous Flow
      โ†“
Policy Network predicts ฮตฮธ(x_t,t)
      โ†“
Clean Action Sequence (Optimal Trajectory)
      โ†“
Execute in Environment (Robotics / Control)
Function Formula Derivative Core Idea Usage / Notes
Sigmoid \(f(x) = \frac{1}{1 + e^{-x}}\) \(f'(x) = f(x)\,[1 - f(x)]\) Smooth bounded mapping (0, 1) Common in probabilistic outputs
Tanh \(f(x) = \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\) \(f'(x) = 1 - f(x)^2\) Zero-centered output Improves symmetry over Sigmoid
ReLU \(f(x) = \max(0,\,x)\) \(f'(x)=\begin{cases}1,&x>0\\0,&x\le0\end{cases}\) Sparse and efficient Fast convergence, stable training
Leaky ReLU \(f(x)=\max(\alpha x,\,x)\) piecewise constant Avoids dead neurons Small negative slope for x < 0
Swish / SiLU \(f(x)=x\,\sigma(x),\ \sigma(x)=\frac{1}{1+e^{-x}}\) \(f'(x)=\sigma(x)+x\,\sigma(x)[1-\sigma(x)]\) Smooth, self-gated ReLU Used in Google EfficientNet
Mish \(f(x)=x\,\tanh(\ln(1+e^x))\) smooth Non-monotonic, better gradient flow Used in YOLOv4, ResNet variants
GELU \(f(x)=x\,\Phi(x),\ \Phi(x)\text{: Gaussian CDF}\) smooth Probabilistic gating Default in Transformers (BERT, GPT)
JumpReLU (DeepMind) \(f(x)=\max(0,\,x-j),\ j\text{ learned}\) piecewise constant Learnable sparsity threshold Used in Sparse Autoencoders for interpretability
Softmax \(f_i(x)=\frac{e^{x_i}}{\sum_j e^{x_j}}\) โ€” Converts logits โ†’ probabilities Standard output for classification


Learning Rates

Trend Description Representative Systems
Cosine + Warmup โ†’ Standard Default Most stable across architectures. ViT, GPT-J, Whisper, Stable Diffusion
Adaptive + Restart Hybrids Combine SGDR + ReduceLROnPlateau. DeepSpeed, Megatron-LM, PaLM 2
Optimizer-Integrated Scheduling Scheduler coupled with optimizer (AdamW, LAMB). GPT-4, Gemini 1.5, Claude 3
Noisy / Stochastic Schedules Inject noise to encourage flat minima. Google Brain NAS, RL-based training
Dynamic Data-Aware LR Control LR adapted by validation loss or gradient norm. Reinforcement fine-tuning (RLHF, PPO)


Scaling Law

Year Model Number of Layers Parameter Count FLOPs (per inference) Activations (per forward pass) Typical Memory Footprint
1998 LeNet 5 ~0.1 M ~0.001 GFLOPs < 1 MB < 10 MB
2012 AlexNet 8 60 M ~1.5 GFLOPs ~100 MB ~1 GB
2015 VGG-16 16 138 M ~15 GFLOPs ~200 MB ~2โ€“4 GB
2016 ResNet-152 152 60 M ~11 GFLOPs ~250 MB ~4โ€“6 GB
2018 BERT-Large 24 340 M ~180 GFLOPs ~1 GB ~10โ€“12 GB
2020 GPT-3 96 175 B ~3.1 ร— 10ยนยฒ FLOPs ~20 GB ~350 GB (weights) / > 1 TB (training)
2024 GPT-4 / Gemini 1.5 / Claude 3 ~120 โ€“ 200 > 1 T (trillion) ~10ยนยณ โ€“ 10ยนโด FLOPs > 50 GB (activations) Multiple TB (large-scale training)


Generalization and Regularization

Underfitting:     Overfitting:        Good Embedding:
 โ€ข โ€ข โ€ข โ€ข โ€ข        โ—โ—โ—  โ—‹โ—‹โ—‹  โ–ฒโ–ฒโ–ฒ       โ— โ—   โ—‹ โ—‹   โ–ฒ โ–ฒ
 โ—‹ โ—‹ โ—‹ โ—‹ โ—‹        (tight) (tight)      (clear but smooth)
 โ–ฒ โ–ฒ โ–ฒ โ–ฒ โ–ฒ        val points outside   val & train overlap
Principle Intuition
Regularization = adding controlled noise or constraints to prevent memorization. Introduces noise or limits (e.g., dropout, weight decay, data augmentation) so the model learns general patterns instead of memorizing the training set.
Overfitting = perfect fit on training data, poor generalization. The model minimizes training loss too well, capturing noise instead of true structure โ€” leads to poor performance on unseen data.
Goal = flatter minima + smoother decision boundaries. Seek regions in the loss landscape where small parameter changes do not greatly affect loss โ€” resulting in more stable, generalizable models.


CNN

[Input  Dร—E  (image or signal)]
      โ”‚
      โ–ผ
[Convolution  Uร—V  (kernel/filter)]
      โ”‚  learns local spatial patterns
      โ”‚  parameters โ‰ช fully-connected layers
      โ–ผ
[Zero-Padding / Stride Control]
      โ”‚
      โ”œโ”€ Padding โ†’ keeps size (same)
      โ””โ”€ Stride  โ†’ downsamples (Dโˆ’U)/S+1
      โ–ผ
[Feature Map  Kร—M  (activation before nonlinearity)]
      โ”‚
      โ–ผ
[Activation  g(a)  โ†’ ReLU / Sigmoid / Tanh]
      โ”‚
      โ–ผ
[Pooling  Rร—R  window (Avg / Max / Global)]
      โ”‚
      โ”œโ”€ replaces stride for down-sampling
      โ”œโ”€ reduces spatial size, increases receptive field
      โ””โ”€ enhances translation invariance
      โ–ผ
[Stacked Conv + Pooling Layers]
      โ”‚
      โ”œโ”€ small kernels (3ร—3) + pooling โ‡’ large receptive field
      โ”œโ”€ more layers > larger kernels (prefer depth)
      โ””โ”€ weights grow linearly w/ layers
      โ–ผ
[Flatten or Global Pooling]
      โ”‚
      โ”œโ”€ flatten:  A โˆˆ โ„^{Qร—Kร—M} โ†’ a โˆˆ โ„^{QยทKยทM}
      โ””โ”€ global pooling:  spatial avg โ†’ a โˆˆ โ„^{Q}
      โ–ผ
[Fully-Connected Layer + Loss]
      โ”‚
      โ”œโ”€ Regression โ†’ J_L2
      โ”œโ”€ Binary โ†’ J_BCE
      โ””โ”€ Categorical โ†’ Softmax + J_CCE
      โ–ผ
[Output Prediction y  / Class Probabilities]

Forward Pass

Input (32ร—32ร—3)
      โ†“
Conv (3ร—3 kernel, 16 filters)
      โ†“
ReLU activation
      โ†“
Max Pooling (2ร—2)
      โ†“
Conv (3ร—3 kernel, 32 filters)
      โ†“
ReLU
      โ†“
Global Avg Pooling
      โ†“
Flatten โ†’ Dense (Fully-connected)
      โ†“
Softmax โ†’ [Cat, Dog, Car, โ€ฆ]


Optimization for Training

Stage Method Purpose / Effect
Initialization Stage Xavier / He initialization Avoid falling into poor regions at the start
Early Exploration Stage Large learning rate + Momentum Maintain global exploration ability
Mid Convergence Stage Adam / RMSProp + Cosine Annealing Ensure smooth descent and curvature adaptation
Late Fine-tuning Stage SAM / Entropy-SGD / Weight Decay Locate flat minima and enhance generalization
During Training Mini-batch noise + Dropout Prevent getting stuck at saddle points
Architectural Level Residual connections / Normalization layers Improve gradient flow and smooth the optimization landscape


Normalization and Regularization in different Model Structures

Item L1 Regularization L2 Regularization
Shape Diamond-shaped constraint Circular constraint
Optimum Point Usually lies on the coordinate axes (sparse solution) Usually lies on the circle (continuous shrinkage)
Result Some weights are โ€œcutโ€ to exactly 0 All weights are smoothly reduced but remain non-zero

Project 1 Visualization

Model Example Normalization Regularization Essence & How It Works
CNN (e.g., ResNet) Batch Normalization โ€” normalizes activations within a mini-batch to stabilize gradients and speed up convergence. Weight Decay + Dropout โ€” penalizes large weights and randomly drops neurons to reduce overfitting. Normalization equalizes feature scales during training, while Regularization constrains model capacity to improve generalization.
Transformer / LLM Layer Normalization โ€” normalizes hidden states across features to maintain stable activations in deep attention layers. Attention Dropout + L2 Regularization โ€” randomly masks attention links and adds weight penalties to prevent overfitting. Normalization stabilizes internal representations; regularization prevents memorization of training data.
MLP Input Standardization โ€” rescales each input feature to zero mean and unit variance. L2 Regularization (Ridge) โ€” discourages large parameter magnitudes for smoother mappings. Normalization improves numerical stability; regularization enforces simpler models with better generalization.


Optimized Decoding

Classical Decoding (Without KV Cache)             Optimized Decoding (With KV Cache)

 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚   Decoder     โ”‚                                โ”‚   Decoder + KV Cache   โ”‚
 โ”‚  (Self-Attn)  โ”‚                                โ”‚  (Self-Attn + Storage) โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                                                   โ”‚
         โ–ผ                                                   โ–ผ
 โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
 โ”‚ Recompute all โ”‚   O(nยฒ) per step               โ”‚  Reuse stored K/V       โ”‚
 โ”‚ past tokens   โ”‚ -----------------------------> โ”‚  Only new Q calculated  โ”‚
 โ”‚ at every step โ”‚                                โ”‚  O(n) per step          โ”‚
 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
         โ”‚                                                     โ”‚
         โ–ผ                                                     โ–ผ
     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
     โ”‚ Latency โ”‚                                       โ”‚  Low Latency  โ”‚
     โ”‚  High   โ”‚                                       โ”‚  On-Device OK โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                                       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

   - Redundant computation                           - No recomputation
   - High memory bandwidth                           - Lower memory & power
   - Slow inference                                  - Faster inference


Training Loop
    โ†“
[ Forward pass ]
    โ†“
[ Compute loss ]
    โ†“
[ Backward pass: compute gradients ]
    โ†“
[ **Gradient Clipping** ]       โ†โ€” `clip_grad_norm_(model.params, max_norm)`
    โ†“
[ **AdamW Update** ]            โ†โ€” `optimizer = AdamW(lr=โ€ฆ, weight_decay=โ€ฆ)`
    โ†“
[ Zero Gradients ]          โ†โ€” `optimizer.zero_grad()`
    โ†“
[ **Cosine LR Annealing** ]     โ†โ€” `scheduler = CosineAnnealingLR(optimizer, T_max, eta_min)`
    โ†“
[ Next batch ]


Single card peak ~ 13โ€“15 GB VRAM -> Start your Experiment with FP16 + AMP

Whisper large-v3 has the same architecture as the previous large and large-v2 models, except for the following minor differences:

  1. The spectrogram input uses 128 Mel frequency bins instead of 80
  2. A new language token for Cantonese
  3. Each token output by Attention carries global context information, while FFN applies "fine-tuning" or "feature combination" to each token to improve the feature quality at each position


Transformer Assembly Line  
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Raw Donuts โ†’ Community Check โ†’ Solo Decoration โ†’ Finished Donuts  
(Input)       (Attention)       (FFN)            (Output)
    โ†“             โ†“                โ†“               โ†“
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Plain   โ”‚โ†’ โ”‚  Community   โ”‚โ†’ โ”‚   Solo       โ”‚ โ†’  โ”‚Gourmet  โ”‚
โ”‚ Donuts  โ”‚  โ”‚  Analysis    โ”‚  โ”‚ Decoration   โ”‚    โ”‚Donuts   โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
   โ†“                โ†“                 โ†“                 โ†“
   Xโ‚€          Xโ‚ = Attention     Xโ‚‚ = FFN            Output

 1. Xโ‚แตข = ฮฃโฑผ ฮฑแตขโฑผ ร— V_j               (Global Linear)  
 2. Xโ‚‚แตข = Wโ‚‚ยทReLU(Wโ‚ยทXโ‚แตข + bโ‚) + bโ‚‚  (Local Nonlinear)

Attention: Convex combination โ†’ Stays within input space
FFN: Nonlinear transformation โ†’ Can transcend input space
Activation Function Characteristics Comparison
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Function  โ”‚ Smoothness โ”‚ Computational โ”‚ Gradient     โ”‚ Performance โ”‚
โ”‚          โ”‚            โ”‚ Complexity    โ”‚ Properties   โ”‚             โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ReLU     โ”‚ Non-smooth โ”‚ Minimal       โ”‚ Sparse       โ”‚ Baseline    โ”‚
โ”‚ GELU     โ”‚ Smooth     โ”‚ Moderate      โ”‚ Dense        โ”‚ Better      โ”‚
โ”‚ SwiGLU   โ”‚ Smooth     โ”‚ High          โ”‚ Gated        โ”‚ Best        โ”‚
โ”‚ Mish     โ”‚ Very Smoothโ”‚ High          โ”‚ Adaptive     โ”‚ Very Good   โ”‚
โ”‚ Swish    โ”‚ Smooth     โ”‚ Moderate      โ”‚ Self-gated   โ”‚ Good        โ”‚
โ”‚ ELU      โ”‚ Smooth     โ”‚ Moderate      โ”‚ Negative-safeโ”‚ Good        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜


Input Features From Whisper

Project 1 Visualization

Project 1 Visualization


Fourier Transform

        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚        Original Domain        โ”‚
        โ”‚  - Pixels (Images)            โ”‚
        โ”‚  - Samples (Audio, Signals)   โ”‚
        โ”‚  - Tokens (Text)              โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚
                       โ–ผ  Fourier Transform
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚        Frequency Domain       โ”‚
        โ”‚  - Low frequencies โ†’ smooth   โ”‚
        โ”‚  - High frequencies โ†’ edges   โ”‚
        โ”‚  - Harmonics โ†’ fine details   โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                       โ”‚
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ–ผ              โ–ผ                    โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚   CNNs    โ”‚  โ”‚ Transformersโ”‚      โ”‚ Speech/Imageโ”‚
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
   - Edges = HF   - Sinusoidal pos.    - STFT / spectrogram
   - Smooth = LF    encoding           - Highlight textures
   - Convolutions   (frequency basis)  - Recognize phonemes
     simplified                        - Detect fine image details
     in frequency
     space
Teacher (Whisper-large-v2)                     Student
โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€                  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€

Audio Input                                     Audio Input  
1 ร— T samples                                   1 ร— T samples
     โ”‚                                               โ”‚
     โ–ผ                                               โ–ผ
Whisper Encoder                                 Whisper Encoder
1280-d hidden, T~1500 frames                    768-d hidden, T~499 frames
(32 layers, FROZEN)                            (12 layers, FROZEN)
     โ”‚                                               โ”‚
     โ”‚                                               โ”‚
     โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€ Hidden States โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ โ”œโ”€โ”€โ”€ Projection Layer โ”€โ”€โ”€โ”
     โ”‚        (B,1500,1280)                          โ”‚    (768โ†’1280)          โ”‚
     โ”‚                                               โ”‚                        โ”‚
     โ–ผ                                               โ–ผ                        โ–ผ
Whisper Decoder                                 Whisper Decoder              Aligned Hidden
(32 layers, FROZEN)                            (4 layers +/ LoRA)            (B,1500,1280)
     โ”‚                                               โ”‚                        โ”‚
     โ”‚                                               โ”‚                        โ”‚
     โ–ผ                                               โ–ผ                        โ”‚
Teacher Logits โ”€โ”€โ”€โ”€โ”€โ”€ Soft Targets โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ถ Student Logits                  โ”‚
(B,seq,vocab)         (KL Loss)               (B,seq,vocab)                   โ”‚
     โ”‚                 T=temperature              โ”‚                           โ”‚
     โ”‚                                            โ”‚                           โ”‚
     โ”‚                                            โ–ผ                           โ”‚
     โ”‚                                      Hard Labels โ—€โ”€โ”€ Ground Truth      โ”‚
     โ”‚                                      (CTC Loss)                        โ”‚
     โ”‚                                            โ”‚                           โ”‚
     โ”‚                                            โ”‚                           โ”‚
     โ”‚                                            โ–ผ                           โ”‚
     โ”‚                                       Student Loss โ—€โ”€โ”€โ”€ MSE Loss โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
     โ”‚                                            โ”‚         (Hidden Align)
     โ”‚                                            โ”‚
     โ–ผ                                            โ–ผ
No parameter updates                        (LoRA) + Projection parameters
(Inference only)                             ONLY these are trained

Why Tโ‰ˆ499

  • Whisper feature extractor
  • The original audio (30 s) generates about 3000 frames of 80-dimensional log-Mel features at a granularity of 10 ms per frame
    • Whisper first divides the original mono audio (30 seconds, 16 kHz) into several short segments
    • Generate an 80-dimensional log-Mel feature every 10 ms
    • 30 s / 0.01 s = 3000 frames
    • These 3000 frames are still very dense. If Transformer processes them directly, the computational workload and memory requirements will be too high
  • Before being fed into the Transformer encoder, these 3000 frames are first downsampled through a convolutional layer (stride=2), and then continuously merged or downsampled in the multi-layer Transformer block
  • The final output length is about 3000 / 2 / 3 = 500 frames (actually 499 frames)
30 s audio
    โ‡“ (extract 80-dim log-Mel every 10 ms)
3000 frames
    โ‡“ (convolutional layer with stride=2)
1500 frames
    โ‡“ (further down-sampling/merging inside the Transformer encoder โ‰ˆร—3)
    โ‡“ (Pooling or Conv1d: kernel_size=3, stride=3)
โ‰ˆ500 frames  (actually 499 frames)
  • Audio Signal Characteristics - Redundancy -> why can be compressed to T~499 frames
1. Audio frame rate is typically high
sample_rate = 16000      # 16 kHz sampling rate
frame_rate = 100         # 100 frames per second
frame_duration = 10      # 10 ms per frame

2. 30 seconds of audio
total_frames = 30 * frame_rate  # 3000 frames

3. Adjacent frames are highly correlated
correlation_coefficient โ‰ˆ 0.9  # typical inter-frame correlation
  • Always remember to do Automatic checkpoint saving
  • !pip install -U bitsandbytes>=0.41.0
  • Put Your Teacher model on CPU
  • MIN_DURATION = 1.0
  • MAX_DURATION = 30.0 # Same as Whispe maximum acceptance length


Orignial LoRA Paper

ฮ”W = A ยท B -> only low-rank increments are made to W_q and W_v in the attention


Choices of LoRA Injection

decoder.layers.*.encoder_attn.q_proj
decoder.layers.*.encoder_attn.v_proj
decoder.layers.*.self_attn.q_proj
decoder.layers.*.self_attn.v_proj
decoder.layers.*.encoder_attn.q_proj, encoder_attn.k_proj, encoder_attn.v_proj
decoder.layers.*.self_attn.q_proj, self_attn.k_proj, self_attn.v_proj
decoder.layers.*.fc2
decoder.layers.*.encoder_attn.q_proj, encoder_attn.k_proj, encoder_attn.v_proj, encoder_attn.out_proj
decoder.layers.*.self_attn.q_proj, self_attn.k_proj, self_attn.v_proj, self_attn.out_proj
decoder.layers.*.fc1, fc2


Temperature

  • Initial pilot temperature: T =
  • Search range: [ ]
  • Optuna hyperparameter: include temp as a tunable parameter
  • Guidance: prevent over-smoothing (i.e. avoid T > 5)


Hard vs. Soft Labels in Knowledge Distillation

  • Hard Labels: one-hot vectors from ground truth
    y = [0, โ€ฆ, 1, โ€ฆ, 0]
    โ€ข Strong supervision โ†’ binary certainty
    โ€ข Forces correct classification

  • Soft Labels: teacherโ€™s softmax outputs
    p_teacher = [0.6, 0.3, 0.1]
    โ€ข Confidence & uncertainty
    โ€ข Encodes inter-class similarity


Why num_workers Affects GPU Performance

The num_workers parameter in PyTorch DataLoader controls the number of CPU processes responsible for data loading and preprocessing. This directly impacts GPU utilization through data pipeline optimization


Performance Comparison

Single-threaded (num_workers=0)

  • CPU: Loadโ†’Preprocessโ†’Transfer, GPU idle, Loadโ†’Preprocessโ†’Transfer
  • GPU: Idle, Compute, Idle

Multi-threaded (num_workers=4)

  • CPU: Continuous data preparation (4 parallel threads)
  • GPU: Continuous computation (minimal idle time)

Key Insight

  • Increasing num_workers enhances โ€œCUDA kernel parallelismโ€ not by adding GPU parallelism, but by eliminating GPU starvation. Multiple CPU workers ensure the GPU receives a steady stream of preprocessed data, maximizing hardware utilization and reducing training time
  • The optimal num_workers typically ranges from 2-4 per GPU, depending on CPU core count and I/O bottlenecks


CTC Loss - Hard Supervision - Here Cross-Entropy (CE) Loss since Whisper is Seq2Seq with Decoders

Since Whisper is a Seq2Seq model with Decoder, cross-entropy loss is employed here.

The decoder generates hidden state sequences at step $u$: \(\{\mathbf{d}_u\}_{u=1}^U\)

mapping to the target text sequence: \(\{y_u\}_{u=1}^U\)

using token-by-token one-to-one supervision:

  • Token-to-Token Alignment Each step has a clear โ€œcorrectโ€ next token, requiring no implicit alignment
  • One-Step Supervision Cross-entropy is directly applied to the prediction distribution at each position $u$
  • Direct Gradient Backpropagated from the output layer, enabling stable convergence

Cross-Entropy Loss Formula \(\mathcal{L}_{\mathrm{CE}} = -\sum_{u=1}^U \log P_\theta\bigl(y_u \mid y_{<u}, \mathbf{h}_{1:T}\bigr)\)

where:

  • $\mathbf{h}_{1:T}$ represents the audio representation output by the encoder
  • $y_{<u}=(y_1,\dots,y_{u-1})$ are the previously generated tokens
  • $U$ is the target sequence length

Following the encoderโ€™s output audio frame sequence: \(\{\mathbf{h}_t\}_{t=1}^T\)

mapping to transcript tokens: \(\{y_u\}_{u=1}^U\)

without explicit frame-level labels:

  • Frame-to-Token Alignment Automatic alignment from audio frames to text tokens
  • Marginalizing Paths Marginalizing over all possible alignment paths
  • Gradient Signal Gradient signals propagate to all relevant audio frames through attention mechanisms


KL Distillation Loss - Soft Supervision

KL Distillation Loss compares the teacherโ€™s and studentโ€™s posterior distributions over labels at each time-step in latent space

  • Soft Distribution Matching
  • Preference Transfer
  • Capturing Uncertainty

Since the softmax outputs retain probabilities for all tokens, the KL term transfers the teacherโ€™s uncertainty patternsโ€”e.g., when the teacher is unsure between two phonemes, the student learns to mirror that ambiguity

Total Loss

\[L_{\text{total}} = L_{\mathrm{CE}} + 0.xx\,T^{2}\,L_{\mathrm{KD}} + \alpha\,L_{\mathrm{hidden\_align}}\]

where

\[\begin{aligned} & L_{\mathrm{CE}} &&\text{is the hard CE loss}\\ & L_{\mathrm{KD}} = \mathrm{KL}\bigl(p_{\rm teacher}^{T}\;\|\;p_{\rm student}^{T}\bigr) &&\text{is the softened KL-divergence loss with temperature }T\text{ and weight }\0.8 (*the same as student backbone)\\ & L_{\mathrm{hidden\_align}} &&\text{is the projected hidden-state MSE loss with weight }\alpha \end{aligned}\]



Hyperparameter Optimization

With 15hrs dataset experiment, we used 50 rounds to run a โ€œwarm-upโ€ for no problem. If you want to perform large-scale tuning in a production environment, it is recommended to increase n_trials to 50-100

import optuna
from optuna.pruners import MedianPruner
from optuna.samplers import TPESampler

def objective(trial):
    # Distillation loss weights
    alpha = trial.suggest_loguniform("alpha", 1e-3, 1e1)
    beta  = trial.suggest_loguniform("beta",  1e-3, 1e1)
    # Optimization hyperparameters
    lr        = trial.suggest_loguniform("lr",        1e-5, 1e-3)
    batch_size= trial.suggest_categorical("batch_size", [4, 8, 16, 32])
    dropout   = trial.suggest_float("dropout", 0.0, 0.5)
    
    # Train & evaluate with these settings (implement train_and_evaluate accordingly)
    wer = train_and_evaluate(
        alpha=alpha,
        beta=beta,
        learning_rate=lr,
        batch_size=batch_size,
        dropout=dropout,
        pruner=trial  # for early stopping
    )
    return wer

# Pruner to stop unpromising trials early
pruner  = MedianPruner(n_startup_trials=5, n_warmup_steps=100)
sampler = TPESampler()

study = optuna.create_study(
    direction="minimize",
    sampler=sampler,
    pruner=pruner
)
study.optimize(objective, n_trials=100)

print("Best hyperparameters:", study.best_params)


PCA vs. t-SNE vs. UMAP vs. DTW


Project 1 Visualization


- Local weights
w_ij = exp(โˆ’(d(x_i, x_j) โˆ’ ฯ_i) / ฯƒ_i)  
w_ji = exp(โˆ’(d(x_j, x_i) โˆ’ ฯ_j) / ฯƒ_j)

- Fuse into a single โ€œstrengthโ€ score
ฮผ_ij = w_ij + w_ji โˆ’ w_ij * w_ji


Background Knowledge 2

[Training Neural Network]
      โ”‚
      โ–ผ
[Problem: Overfitting]
      โ”‚  model performs well on train set
      โ””โ”€ poor generalization on unseen data
      โ–ผ
[Regularization Strategies]
      โ”‚
      โ”œโ”€ L1 Regularization โ†’ add |w| penalty
      โ”‚     encourages sparsity, feature selection
      โ”‚
      โ”œโ”€ L2 Regularization (Weight Decay)
      โ”‚     adds wยฒ penalty, smooths weights
      โ”‚     reduces variance, stabilizes gradients
      โ”‚
      โ”œโ”€ Early Stopping
      โ”‚     monitor validation loss โ†’ stop early
      โ”‚
      โ”œโ”€ Data Augmentation
      โ”‚     enlarge dataset (flip, crop, color jitter)
      โ”‚     improves robustness & invariance
      โ”‚
      โ””โ”€ Dropout
            randomly deactivate neurons (mask m)
            prevents co-adaptation
            during inference: scale activations by p
      โ–ผ
[Normalization Layers]
      โ”‚
      โ”œโ”€ Batch Normalization (BN)
      โ”‚     normalize activations per mini-batch
      โ”‚     ฮผ_B, ฯƒ_B computed over batch samples
      โ”‚     then apply ฮณ (scale) + ฮฒ (shift)
      โ”‚     allows larger learning rate & faster training
      โ”‚
      โ”œโ”€ Layer Normalization (LN)
      โ”‚     normalize across features, not batch
      โ”‚     used in Transformers (batch-size independent)
      โ”‚
      โ””โ”€ Effect:
            stabilizes gradient flow
            reduces internal covariate shift
            improves convergence speed
      โ–ผ
[Residual Connections]
      โ”‚
      โ””โ”€ skip connection y = F(x) + x
            eases gradient propagation
            enables very deep CNNs (ResNet)
      โ–ผ
[Combined Strategy]
      โ”‚
      โ”œโ”€ Regularization (L1/L2)
      โ”œโ”€ Dropout
      โ”œโ”€ Batch Normalization
      โ””โ”€ Data Augmentation
      โ–ผ
[Result]
      โ”‚
      โ””โ”€ High generalization, stable training,
         smoother optimization landscape,
         reduced overfitting risk
[Closed-Set Classification]
      โ”‚
      โ””โ”€ assumes all test classes are known
         model outputs one of O fixed labels
      โ–ผ
[Open-Set Problem]
      โ”‚
      โ”œโ”€ real-world contains unknown categories
      โ”œโ”€ standard SoftMax โ†’ overconfident wrong predictions
      โ””โ”€ need to reject unseen (unknown) samples
      โ–ผ
[Goal: Open-Set Recognition]
      โ”‚
      โ”œโ”€ recognize known classes correctly
      โ””โ”€ detect / reject unknown classes (OOD)
      โ–ผ
[Two Main Paradigms]
      โ”‚
      โ”œโ”€ Two-Stage OSR
      โ”‚     Stage 1: detect unknowns (OOD)
      โ”‚     Stage 2: classify known samples
      โ”‚
      โ””โ”€ Integrated OSR
            single model learns known + reject class
            adds โ€œunknownโ€ logits or rejection threshold
      โ–ผ
[Core Approaches]
      โ”‚
      โ”œโ”€ OSDN (Open-Set Deep Network)
      โ”‚     compute Mean Activation Vector (MAV)
      โ”‚     distance D_o = ||ฯ• - ฮผ_o||
      โ”‚     fit EVT (Extreme Value Theory) model to tails
      โ”‚
      โ”œโ”€ GHOST (Gaussian Hypothesis OSR)
      โ”‚     per-class Gaussian modeling in feature space
      โ”‚     normalize logits by (ฮผ_o, ฯƒ_o)
      โ”‚     provides calibrated confidence
      โ”‚
      โ”œโ”€ Garbage / Background Class
      โ”‚     add class yโ‚€ for โ€œnone of the aboveโ€
      โ”‚     weighted loss: ฮป_ฯ„ = N / ((O+1)N_ฯ„)
      โ”‚
      โ”œโ”€ Entropic Open-Set Loss
      โ”‚     for unknowns, enforce uniform SoftMax
      โ”‚     target: t_o = 1/O for all o
      โ”‚     equalizes logits โ†’ high entropy
      โ”‚
      โ””โ”€ Confidence Thresholding
            use ฮถ threshold on SoftMax
            accept if max(yฬ‚_o) > ฮถ, else reject
      โ–ผ
[Training]
      โ”‚
      โ”œโ”€ Known samples: one-hot targets
      โ”œโ”€ Unknown samples: uniform targets
      โ””โ”€ Loss combines CE + Entropic term
      โ–ผ
[Evaluation Metrics]
      โ”‚
      โ”œโ”€ CCR (Correct Classification Rate)
      โ”‚     true positives among known samples
      โ”‚
      โ”œโ”€ FPR (False Positive Rate)
      โ”‚     unknowns misclassified as knowns
      โ”‚
      โ””โ”€ OSCR Curve (CCR vs FPR)
            area under curve (AUOSCR) = performance
      โ–ผ
[Modern Implementations]
      โ”‚
      โ”œโ”€ ImageNet-based OSR protocols (P1โ€“P3)
      โ”œโ”€ Feature-space Gaussian models (GHOST)
      โ”œโ”€ Entropic loss + background class hybrid
      โ””โ”€ Evaluation by AIML UZH / WACV 2023
      โ–ผ
[Outcome]
      โ”‚
      โ””โ”€ OSR enables reliable recognition under uncertainty:
         โ€œI know what I know โ€” and I know what I donโ€™t.โ€


Stage Process Mathematical Meaning Intuitive Explanation
Forward Process Add Gaussian noise to clean trajectories \((x_0 \rightarrow x_T)\). \(q(x_t \mid x_{t-1}) = \mathcal{N}(\sqrt{1 - \beta_t} \, x_{t-1}, \, \beta_t I)\) Gradually โ€œscramblesโ€ a human driving path โ€” this step is fixed and not learned.
Reverse Process Learn to denoise noisy trajectories \((x_T \rightarrow x_0)\) conditioned on perception \(c\). \(p_\theta(x_{t-1} \mid x_t, c) = \mathcal{N}(\mu_\theta(x_t, t, c), \Sigma_\theta)\) The model learns to โ€œrestore order from noise,โ€ reconstructing human-like trajectories that fit the scene.
Prior-Guided Learning Add an Anchored Gaussian prior for realistic initialization. \(x_T \sim \mathcal{N}(\mu_{anchor}, \sigma^2 I)\) The model doesnโ€™t predict trajectories directlyโ€”it learns to move toward the probability distribution of human driving behaviors.


Temporal Alignment Leakage

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Temporal Downsampling Effect           โ”‚
โ”‚                                         โ”‚
โ”‚  Teacher Sequence (1500 frames)         โ”‚
โ”‚  โ”Œโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”  โ”‚
โ”‚  โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚โ–“โ”‚  โ”‚
โ”‚  โ””โ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”˜  โ”‚
โ”‚          โ†“ 3:1 compression              โ”‚
โ”‚  Student Sequence (499 frames)          โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚ โ–“โ–“โ–“ โ”‚ โ–“โ–“โ–“ โ”‚ โ–“โ–“โ–“ โ”‚ โ–“โ–“โ–“ โ”‚ โ–“โ–“โ–“ โ”‚ โ–“โ–“โ–“ โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚     โ†‘                                   โ”‚
โ”‚  Information "leaks" to adjacent windowsโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Method Memory Usage Training Speed
Normal Training High (store all activations) Fast (no recomputation needed)
Checkpointing Low (store partial activations) Slow (extra recomputation needed)


Gradient Checkpointing

Forward Pass:
Input โ†’ [Layer1: store] โ†’ [Layer2: recompute later] โ†’ [Layer3: recompute later] โ†’ Output

Backward Pass:
Recompute Layer2 & Layer3 forward
Use recomputed activations โ†’ compute gradient
Use Layer1 activation โ†’ compute gradient



References



References