Speech Processing x Transfer Learning - 25

Welcome,

A Coding Demo Example for Distilled-ASR using xxx with - A Visual Interactive Demo - teach you about the Latent Space I / we used in the project


  • A dynamic Latent Trajectory Visualization

  • with LoRA / QLoRA Adapter


📍

  • CTC
  • Seq2Seq+Attention
  • RNN-Transducer
  • QLoRA-Based Adapters
  • KL-Distillation
  • FLEURS - DataSet
  • on-device ASR 📍






References





0. Some Background Knowledge

0.1 Core Evolution of Voice Models

Year Milestone Model / Paper
2014 End-to-end ASR DeepSpeech (Hannun et al.)
2017 Tacotron (neural TTS) Tacotron (Wang et al.)
2019 Real-time voice synthesis FastSpeech (Ren et al.)
2020 Self-supervised wav2vec 2.0 (Baevski et al.)
2022 Multilingual speech models Whisper (OpenAI, 2022)
2023 Zero-shot voice cloning VALL-E (Microsoft, 2023)
2023–2024 Diffusion-based TTS FastDiff (Huang et al.)
2024 Multi-modal voice models AudioLM 2 (Borsos et al.)



0.2 Key Technical History

Period Model Category Core Principle
2014–2017 RNN - LSTM / GRU Sequence modeling, LSTM / GRU Solved vanishing gradient issues
2018–2020 Transformer / Conformer Self-Attention + CNN, Parallelizable computation for Efficiency
2019–2022 GAN-based Models TTS, Real-time audio Denoising for Hearing Aids
2021–Present Diffusion Models Zero-shot / Few-shot
Present SSL / Lightweight Self-supervised learning, Compression, Distillation








📍 0.3 Label Encoding Methods in Speech Models and Distillation


  • Label encoding refers to methods that convert categorical labels into numerical representations for machine learning. These strategies have evolved to serve different training paradigms, including classification, multi-label tasks, and model compression (e.g., distillation).


Historical Context

  • One-hot encoding: Early standard in classification models.
  • Label/Ordinal/Binary encoding: Introduced for efficient encoding in decision trees and statistical models.
  • Embedding encoding: Emerged with deep learning to represent semantic relationships.
  • Soft label: Popularized in knowledge distillation (Hinton et al., 2015).


Common Label Encodings Overview


Method Example Relationship Preserved Use Case
One-hot [0, 1, 0] CTC loss in ASR
Integer Label 0, 1, 2 ⚠️ (Implied order) Tree models
Ordinal 1, 2, 3 Ranked categories
Binary A=0001 Partially Large-ID categories
Embedding [0.1, -0.2, 0.3] ✅ (Learned) Token representation
Multi-hot [1, 0, 1] Multi-label tasks
Soft label [0.1, 0.7, 0.2] Distillation training



What Domain Does Label Encoding Belong To?


  • Label encoding methods are fundamental to many stages of machine learning pipelines, from raw data preprocessing to model compression.


Encoding Knowledge Domain
Categorical label transformation Machine Learning
One-hot, Binary, Ordinal encoding Data Preprocessing / Feature Engineering
Embedding encoding Representation Learning / NLP / Speech
Soft label distillation Model Compression / Knowledge Transfer
Token label supervision Deep Learning (CTC, ASR, Transformer)


  • These methods are crucial for enabling models to interpret, learn, and generalize from categorical data, especially in speech and language processing.



Historical Timeline and Motivation

Encoding Method Introduced Why It Was Introduced
One-hot Encoding 1960s–1970s To represent categories without implying order; widely used in early neural nets and perceptrons
Label / Integer Encoding 1980s Compact representation for tree models; useful in statistical and rule-based methods
Ordinal Encoding 1980s Needed when categories have intrinsic order (e.g., low < medium < high)
Binary Encoding 1990s To handle high-cardinality categories without exploding dimensionality (e.g., postal codes, product IDs)
Embedding Encoding 2013+ Emerged with Word2Vec and deep learning to learn semantic similarity between tokens
Soft Label (for Distillation) 2015 (Hinton et al.) To enable compact student models to mimic richer knowledge from larger teachers
Multi-hot Encoding 2000s Designed for multi-label classification tasks (e.g., image with multiple objects)


  • Most modern deep learning tasks—especially those involving transformers, adapters, or sequence models—use a combination of one-hot, embedding, and soft labels depending on the training phase.





1. Some Sample Models from Industry

1.1 - Self-supervised



1.2 - Zero-shot



1.3 - Diffusion-based



1.4 - Neural Audio Codec



**1.5 - Multi-modal** - will discuss in the future



2. Model Training

2.1 Pre-training with text



  • DataSet in use
    • LibriSpeech ASR Corpus
    • Hugging Face: https://huggingface.co/datasets/librispeech_asr
    • OpenSLR: http://www.openslr.org/12




Sample Code

xx







📍 2.2 Some Sample Models from the Industry for Fine-tuning



  • Speech -




2.3 Post-Training

  • Pre-Train Style
    • 📍 Distillation
      • Loss function \(\mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{CE}}(y_s, y_{\text{true}}) + \beta \cdot \mathcal{L}_{\text{KL}}(y_s^T, y_t^T)\)

      Where:

      • $y_{\text{true}}$ : Ground-truth hard label
      • $y_s$ : Student output logits
      • $y_s^{(T)}$ : Student soft prediction with temperature $T$
      • $y_t^{(T)}$ : Teacher soft prediction with temperature $T$
      • $\alpha, \beta$ : Weight coefficients, used to balance supervised learning and distillation objectives
    • SSL
    • demo 1

    • Why Distillation
      • To reduce Model Size + Inference Latency
      • Issues may occur - Information Loss + Capacity Bottleneck (can be sloved)




  • Supervised-Fine-Tuning Style
    • 📍 Adapter - lightweight
    • Prompt-tuning
    • demo 2

    • Why Adapter
      • To Fine-tune Efficiently by training Only A Small Set of Parameters
      • Issues may occur - Task Mismatch with the Pre-training Model + Output Alignment (Enhancement tasks require frame-level supervision, but wav2vec2 is trained for sequence-to-sequence token prediction)




  • Reinforcement-Learning Style
    • RLHF




  • Others - Generative Enhancement Style
    • DNN-GAN for Speech Denoising






2.4 Some Parameters and Optimatzers





3. Possible Improvements to the Foundation Models / During Fine-Tuning

3.1 Catastrophic Forgetting

Possible Solutions 🪨

3.2 Task Targeted Post-training will Degrade the model’s performance on other Tasks - e.g. Safety Alignment

Possible Solutions 🪨





4. Recent Technical Advances - pay attention to the 📍 ones



5. Products for Disabled People / Hearing Aid Enhancement

5.1 Key References


5.2 Aspects

  • Model Adaptability

  • Computational Efficiency

  • Customization / Personalization

5.3 Products




6. Some Startups

  • AudioShake
    • Key Tech
      • Stem Separation - CNN / RNN
      • Supervised Fine-Tuning - Utilizes labeled datasets to train models for separating different audio stems (e.g., vocals, drums)
      • Transfer Learning - Leverages pre-trained models on large audio datasets, adapting them to specific stem separation tasks
    • Markets
      • Music production, film and television post-production, podcast editing, game audio processing, user-generated content - UGC
  • ElevenLabs

    • Key Tech
      • TTS, Voice Cloning, Voice Conversion, STT - DNN
      • Supervised Fine-Tuning - Trains models on paired text and speech data to generate natural-sounding speech
      • Voice Cloning - Adapts models to replicate specific voices using limited voice samples
      • Multilingual Fine-Tuning - Extends models to support multiple languages by fine-tuning on diverse linguistic datasets.
    • Markets
      • Audiobooks, podcast production, game dubbing, virtual assistants, educational content, film and television dubbing
    • Python SDK
  • LiveKit

    • Key Tech
      • Real-Time Communication Platform, Voice AI Agent Framework, Edge Infrastructure
      • Transformer / DNN / VAD
      • Supervised Fine-Tuning
      • Transfer Learning
    • Markets
      • Live video conferencing, voice chat, virtual events, online education, customer support
    • livekit
  • RealAvatar.ai

    • Key Tech
      • Multimodal AI Interaction, AI Avatars - DNN / Transformer
      • Supervised Fine-Tuning
      • Transfer Learning
    • Markets
      • Education and training, customer service, virtual assistant, online consultation, content creation




7. Some Terms and their Nature

Additive Attention

Additive Attention computes the attention scores using a feed-forward neural network

\[e_i = \mathbf{v}^T \tanh(\mathbf{W}_1 \mathbf{q} + \mathbf{W}_2 \mathbf{k}_i)\] \[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\] \[\mathbf{c} = \sum_i \alpha_i \mathbf{v}_i\]

Dot-Product Attention

Dot-Product Attention calculates the attention scores by taking the dot product of the query and key vectors

\[e_i = \mathbf{q}^T \mathbf{k}_i\] \[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\] \[\mathbf{c} = \sum_i \alpha_i \mathbf{v}_i\]

Scaled Dot-Product Attention

To mitigate the issue of large dot product values in high-dimensional spaces, Scaled Dot-Product Attention scales the dot products

\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V}\]



  • Attention Layer - Parameterized by a simple feed-forward network

  • Decoder - RNN with input from previous state + dynamic context vector

  • Tensor2Tensor Notebook

  • Self-Attention - Where Each Token Attends to All other tokens in the same Sequence

  • Multi-head Self-attention - Runs multiple self-attention mechanisms in Parallel to capture different relationships

  • Activation Functions - Non-linear functions after Neural Layers

    • Softmax:
      \(\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\)
      Used to normalize attention scores into a probability distribution over keys

    • Tanh:
      \(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
      Maps input to the range ([-1, 1]), commonly used in RNNs and attention scoring

    • ReLU - Rectified Linear Unit:
      \(\text{ReLU}(x) = \max(0, x)\)
      Introduces sparsity and alleviates the vanishing gradient problem

    • GELU - Gaussian Error Linear Unit:
      \(\text{GELU}(x) = x \cdot \Phi(x)\)
      where \(\Phi(x) = \frac{1}{2} \left[ 1 + \text{erf} \left( \frac{x}{\sqrt{2}} \right) \right]\) is the standard Gaussian cumulative distribution function - CDF GELU is smoother than ReLU and is widely used in Transformers


  • Why need Positional Encodings - To give the model a sense of token order, since Transformers have no recurrence or convolution

  • Why Adding Residual Connections - To Ease Gradient Flow and Improve Training Stability in Deep Networks

  • Why Need Normalizations

    • Stabilize + Accelerate Training
    • Improve Generalization
    • Handle scale variance across features



📍 Why Need Normalizations

  • Stabilize and accelerate training by controlling the distribution of activations
  • Improve generalization across tasks
  • Handle scale variance across features, samples, and batches
  • Enable training with higher learning rates without divergence



Techniques

  • Layer Normalization
    • Normalizes across all features within each token (sample-wise)
    • Used in: Transformers, Speech Models (e.g., wav2vec2, ASR)
  • Batch Normalization
    • Normalizes each feature across the batch
    • Used in: CNNs, MLPs, Image Classification (ResNet, VGG)
  • Instance Normalization
    • Normalizes each sample and channel separately
    • Used in: Style Transfer, Image Generation
  • Group Normalization
    • Normalizes within groups of channels
    • Used in: Vision tasks with small batches (e.g., segmentation, GANs)
  • RMSNorm
    • Root-mean-square-only scaling (no mean subtraction)
    • Used in: Lightweight Transformers, TinyLMs
  • Weight Normalization
    • Normalizes weight vectors instead of activations
    • Used in: Reinforcement Learning, Sparse Models




📍 Why Need Regularization

  • Prevent overfitting to training data
  • Improve robustness and generalization
  • Avoid co-adaptation of neurons
  • Stabilize weight growth and gradient flow



Techniques

  • L1 Regularization (Lasso)
    • Encourages sparsity, useful for feature selection
  • L2 Regularization (Ridge)
    • Penalizes large weights, discourages complexity
  • Elastic Net
    • Combines L1 + L2 for balanced sparsity + smoothness
  • Dropout
    • Randomly removes neurons during training to prevent co-adaptation
  • DropConnect
    • Randomly removes connections (weights), adds structural noise
  • Stochastic Depth
    • Randomly skips entire layers, improves ensemble-like diversity
  • Early Stopping
    • Halts training when validation loss stops improving
  • Weight Decay
    • Applies L2 penalty during optimizer update (e.g., AdamW)
  • Label Smoothing
    • Softens targets to avoid overconfidence in classification
  • Data Augmentation
    • Expands training data via noise, rotation, cropping, etc.
  • Mixup / CutMix
    • Mix inputs and/or regions from multiple samples for better decision boundaries
  • Noise Injection
    • Adds Gaussian noise to inputs or gradients for robustness
  • Max-Norm Constraint
    • Limits the norm of weights for regularized learning
  • Gradient Clipping
    • Prevents exploding gradients, especially in RNNs





References






Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Neuroscience Basis For Hearing & Vision - 25
  • AI Character Consistency x Memory - 25
  • Mathematical Fractals
  • Structures for Speech Processing - 25
  • Models Post-Training - 25