Speech Processing x Transfer Learning - 25
Welcome,
A Coding Demo Example for Distilled-ASR using xxx with - A Visual Interactive Demo - teach you about the Latent Space I / we used in the project
-
A dynamic Latent Trajectory Visualization
-
with LoRA / QLoRA Adapter
📍
- CTC
- Seq2Seq+Attention
- RNN-Transducer
- QLoRA-Based Adapters
- KL-Distillation
- FLEURS - DataSet
- on-device ASR 📍
-
1989 - Coalescing Random Walks and Voter Model Consensus Times on the Torus in Zd
-
2019 - ASR - Patient Knowledge Distillation for LSTM-Based Acoustic Models
References
0. Some Background Knowledge
0.1 Core Evolution of Voice Models
Year | Milestone | Model / Paper |
---|---|---|
2014 | End-to-end ASR | DeepSpeech (Hannun et al.) |
2017 | Tacotron (neural TTS) | Tacotron (Wang et al.) |
2019 | Real-time voice synthesis | FastSpeech (Ren et al.) |
2020 | Self-supervised | wav2vec 2.0 (Baevski et al.) |
2022 | Multilingual speech models | Whisper (OpenAI, 2022) |
2023 | Zero-shot voice cloning | VALL-E (Microsoft, 2023) |
2023–2024 | Diffusion-based TTS | FastDiff (Huang et al.) |
2024 | Multi-modal voice models | AudioLM 2 (Borsos et al.) |
0.2 Key Technical History
Period | Model Category | Core Principle |
---|---|---|
2014–2017 | RNN - LSTM / GRU | Sequence modeling, LSTM / GRU Solved vanishing gradient issues |
2018–2020 | Transformer / Conformer | Self-Attention + CNN, Parallelizable computation for Efficiency |
2019–2022 | GAN-based Models | TTS, Real-time audio Denoising for Hearing Aids |
2021–Present | Diffusion Models | Zero-shot / Few-shot |
Present | SSL / Lightweight | Self-supervised learning, Compression, Distillation |
- RNN - 1990 Finding structure in time
- LSTM - 1997 Long Short-Term Memory
- GRU - 2014 Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation
- Transformer - 2017 Attention Is All You Need
- BERT - 2018 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding- Masked Language Modeling - MLM
- Conformer - 2020 Conformer: Convolution-augmented Transformer for Speech Recognition
- GAN - 2014 Generative Adversarial Networks
-
Diffusion Models
-
SSL
📍 0.3 Label Encoding Methods in Speech Models and Distillation
- Label encoding refers to methods that convert categorical labels into numerical representations for machine learning. These strategies have evolved to serve different training paradigms, including classification, multi-label tasks, and model compression (e.g., distillation).
Historical Context
- One-hot encoding: Early standard in classification models.
- Label/Ordinal/Binary encoding: Introduced for efficient encoding in decision trees and statistical models.
- Embedding encoding: Emerged with deep learning to represent semantic relationships.
- Soft label: Popularized in knowledge distillation (Hinton et al., 2015).
Common Label Encodings Overview
Method | Example | Relationship Preserved | Use Case |
---|---|---|---|
One-hot | [0, 1, 0] | ❌ | CTC loss in ASR |
Integer Label | 0, 1, 2 | ⚠️ (Implied order) | Tree models |
Ordinal | 1, 2, 3 | ✅ | Ranked categories |
Binary | A=0001 | Partially | Large-ID categories |
Embedding | [0.1, -0.2, 0.3] | ✅ (Learned) | Token representation |
Multi-hot | [1, 0, 1] | ❌ | Multi-label tasks |
Soft label | [0.1, 0.7, 0.2] | ✅ | Distillation training |
What Domain Does Label Encoding Belong To?
- Label encoding methods are fundamental to many stages of machine learning pipelines, from raw data preprocessing to model compression.
Encoding Knowledge | Domain |
---|---|
Categorical label transformation | Machine Learning |
One-hot, Binary, Ordinal encoding | Data Preprocessing / Feature Engineering |
Embedding encoding | Representation Learning / NLP / Speech |
Soft label distillation | Model Compression / Knowledge Transfer |
Token label supervision | Deep Learning (CTC, ASR, Transformer) |
- These methods are crucial for enabling models to interpret, learn, and generalize from categorical data, especially in speech and language processing.
Historical Timeline and Motivation
Encoding Method | Introduced | Why It Was Introduced |
---|---|---|
One-hot Encoding | 1960s–1970s | To represent categories without implying order; widely used in early neural nets and perceptrons |
Label / Integer Encoding | 1980s | Compact representation for tree models; useful in statistical and rule-based methods |
Ordinal Encoding | 1980s | Needed when categories have intrinsic order (e.g., low < medium < high) |
Binary Encoding | 1990s | To handle high-cardinality categories without exploding dimensionality (e.g., postal codes, product IDs) |
Embedding Encoding | 2013+ | Emerged with Word2Vec and deep learning to learn semantic similarity between tokens |
Soft Label (for Distillation) | 2015 (Hinton et al.) | To enable compact student models to mimic richer knowledge from larger teachers |
Multi-hot Encoding | 2000s | Designed for multi-label classification tasks (e.g., image with multiple objects) |
- Most modern deep learning tasks—especially those involving transformers, adapters, or sequence models—use a combination of one-hot, embedding, and soft labels depending on the training phase.
1. Some Sample Models from Industry
1.1 - Self-supervised
1.2 - Zero-shot
1.3 - Diffusion-based
1.4 - Neural Audio Codec
**1.5 - Multi-modal** - will discuss in the future
2. Model Training
2.1 Pre-training with text
- Spirit LM: Interleaved Spoken and Written Language Model
- OpenAI - Navigating the Challenges and Opportunities of Synthetic Voices
- Toward Joint Language Modeling for Speech Units and Text
- 📍 Model in use
- wav2vec 2.0
- Structure - CNN encoder + Transformer
- Original Task - CTC-based ASR - Automatic Speech Recognition
- base model
- Speech Recognition Pre-Training - Sample codes
- Some discussion
- DataSet in use
- LibriSpeech ASR Corpus
- Hugging Face: https://huggingface.co/datasets/librispeech_asr
- OpenSLR: http://www.openslr.org/12
Sample Code
xx
📍 2.2 Some Sample Models from the Industry for Fine-tuning
- Language - Tülu 3 - by Ai2 - 2024
- Tech Report
- OLMo 2 - Language models
- Open Source
- Speech -
2.3 Post-Training
- Pre-Train Style
- 📍 Distillation
- Loss function \(\mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{CE}}(y_s, y_{\text{true}}) + \beta \cdot \mathcal{L}_{\text{KL}}(y_s^T, y_t^T)\)
Where:
- $y_{\text{true}}$ : Ground-truth hard label
- $y_s$ : Student output logits
- $y_s^{(T)}$ : Student soft prediction with temperature $T$
- $y_t^{(T)}$ : Teacher soft prediction with temperature $T$
- $\alpha, \beta$ : Weight coefficients, used to balance supervised learning and distillation objectives
- SSL
-
demo 1
- Why Distillation
- To reduce Model Size + Inference Latency
- Issues may occur - Information Loss + Capacity Bottleneck (can be sloved)
- 📍 Distillation
- Supervised-Fine-Tuning Style
- 📍 Adapter - lightweight
- Prompt-tuning
-
demo 2
- Why Adapter
- To Fine-tune Efficiently by training Only A Small Set of Parameters
- Issues may occur - Task Mismatch with the Pre-training Model + Output Alignment (Enhancement tasks require frame-level supervision, but wav2vec2 is trained for sequence-to-sequence token prediction)
- Reinforcement-Learning Style
- RLHF
- RLHF
- Others - Generative Enhancement Style
- DNN-GAN for Speech Denoising
2.4 Some Parameters and Optimatzers
3. Possible Improvements to the Foundation Models / During Fine-Tuning
3.1 Catastrophic Forgetting
- 2024 Scaling Laws for Forgetting When Fine-Tuning Large Language Models
- 2023 An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-tuning
Possible Solutions 🪨
- 2024 LoRA Learns Less and Forgets Less
- 2018 The Natural Language Decathlon: Multitask Learning as Question Answering
3.2 Task Targeted Post-training will Degrade the model’s performance on other Tasks - e.g. Safety Alignment
- Supervised-Fine-Tuning Style Post-training - 2024 Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning
- 2023 Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
Possible Solutions 🪨
4. Recent Technical Advances - pay attention to the 📍 ones
-
2025 – Efficient Distillation of Classifier-Free Guidance using Adapters
-
2025 – Neuralink – gets FDA nod for chip that will let speech impaired people speak, human trials soon
This includes those affected by ALS, stroke, spinal cord injury, cerebral palsy, multiple sclerosis, and other neurological conditions. -
2024 – RAST: A Reference-Audio Synchronization Tool for Dubbed Content
-
2023 – Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision
Proposes a novel approach to text-to-speech synthesis using minimal supervision while maintaining high fidelity, making TTS systems more accessible for low-resource settings. -
2023 – Voicebox: Versatile Generative Speech AI (Meta)
A generative model capable of text-to-speech, style transfer, noise removal, and speech editing using just 2 seconds of input audio. -
2023 – VALL-E: Zero-Shot Text-to-Speech via Neural Codec Language Modeling
Achieves personalized speech synthesis from a 3-second voice sample, preserving emotion and acoustic context in zero-shot TTS tasks. -
2023 – Apple Personal Voice & Live Speech
Allows users to generate a personal synthetic voice using only 15 minutes of audio, aiding those at risk of speech loss due to ALS or other conditions. -
2023 – Meta Massively Multilingual Speech (MMS)
Open-source speech-to-text and text-to-speech models for 1,100+ languages, massively expanding multilingual accessibility in speech AI. -
📍 2022 – Whisper: Multilingual ASR via Large-Scale Weak Supervision
A general-purpose speech recognition system trained on 680,000 hours of audio, robust across accents, background noise, and multiple languages. -
2022 - OPRA-RS: A Hearing-Aid Fitting Method Based on Automatic Speech Recognition and Random Search
-
📍 2021 - RNN Based Incremental Online Spoken Language Understanding
-
2021 – Apple On-Device Speech Recognition for Siri
Introduced local processing of Siri speech recognition, enhancing privacy and enabling offline voice commands. -
📍 2020 – wav2vec 2.0: Self-Supervised Learning of Speech Representations
Demonstrated state-of-the-art ASR using very limited labeled data via self-supervised learning on large-scale unlabeled audio. -
📍 2020 – Conformer: Convolution-augmented Transformer for ASR
Combined CNNs and Transformers for effective modeling of both local and global features in speech, improving ASR accuracy. -
2019 – Project Euphonia (Google)
Uses AI to improve ASR for users with atypical speech, such as those with ALS or other disorders, enhancing speech accessibility. -
2016 – WaveNet: A Generative Model for Raw Audio
Introduced deep generative modeling of raw audio, setting a new bar for natural-sounding speech synthesis. -
2015 – Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
Demonstrated that deep learning can perform ASR across languages and noisy conditions without hand-engineered features.
5. Products for Disabled People / Hearing Aid Enhancement
5.1 Key References
5.2 Aspects
-
Model Adaptability
-
Computational Efficiency
-
Customization / Personalization
5.3 Products
-
2024 Hearing Tracker - Hearing Aids with Artificial Intelligence (AI): Review of Features, Capabilities and Models that Use AI and Machine Learning
-
2023 DNN - Restoring speech intelligibility for hearing aid users with deep learning
6. Some Startups
- AudioShake
- Key Tech
- Stem Separation - CNN / RNN
- Supervised Fine-Tuning - Utilizes labeled datasets to train models for separating different audio stems (e.g., vocals, drums)
- Transfer Learning - Leverages pre-trained models on large audio datasets, adapting them to specific stem separation tasks
- Stem Separation - CNN / RNN
- Markets
- Music production, film and television post-production, podcast editing, game audio processing, user-generated content - UGC
- Music production, film and television post-production, podcast editing, game audio processing, user-generated content - UGC
- Key Tech
-
- Key Tech
- TTS, Voice Cloning, Voice Conversion, STT - DNN
- Supervised Fine-Tuning - Trains models on paired text and speech data to generate natural-sounding speech
- Voice Cloning - Adapts models to replicate specific voices using limited voice samples
- Multilingual Fine-Tuning - Extends models to support multiple languages by fine-tuning on diverse linguistic datasets.
- TTS, Voice Cloning, Voice Conversion, STT - DNN
- Markets
- Audiobooks, podcast production, game dubbing, virtual assistants, educational content, film and television dubbing
- Audiobooks, podcast production, game dubbing, virtual assistants, educational content, film and television dubbing
- Python SDK
- Key Tech
-
- Key Tech
- Real-Time Communication Platform, Voice AI Agent Framework, Edge Infrastructure
- Transformer / DNN / VAD
- Supervised Fine-Tuning
- Transfer Learning
- Real-Time Communication Platform, Voice AI Agent Framework, Edge Infrastructure
- Markets
- Live video conferencing, voice chat, virtual events, online education, customer support
- Live video conferencing, voice chat, virtual events, online education, customer support
- livekit
- Key Tech
-
- Key Tech
- Multimodal AI Interaction, AI Avatars - DNN / Transformer
- Supervised Fine-Tuning
- Transfer Learning
- Multimodal AI Interaction, AI Avatars - DNN / Transformer
- Markets
- Education and training, customer service, virtual assistant, online consultation, content creation
- Education and training, customer service, virtual assistant, online consultation, content creation
- Key Tech
7. Some Terms and their Nature
-
Attention - Vector of Importance Weights
-
Encoder - Bidirectional RNN
-
Activation Recomputation / Gradient Checkpoint - Memory-saving technique - Save “important activations” during Forward pass, then recompute / computation overhead when needed in backward pass - Typically Save 30–70% GPU Memory, depending on model depth and recompute granularity
-
Additive Attention - 2014 Neural Machine Translation by Jointly Learning to Align and Translate
-
Dot-product Attention - 2023 Attention Is All You Need
Additive Attention
Additive Attention computes the attention scores using a feed-forward neural network
\[e_i = \mathbf{v}^T \tanh(\mathbf{W}_1 \mathbf{q} + \mathbf{W}_2 \mathbf{k}_i)\] \[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\] \[\mathbf{c} = \sum_i \alpha_i \mathbf{v}_i\]Dot-Product Attention
Dot-Product Attention calculates the attention scores by taking the dot product of the query and key vectors
\[e_i = \mathbf{q}^T \mathbf{k}_i\] \[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\] \[\mathbf{c} = \sum_i \alpha_i \mathbf{v}_i\]Scaled Dot-Product Attention
To mitigate the issue of large dot product values in high-dimensional spaces, Scaled Dot-Product Attention scales the dot products
\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V}\]-
Attention Layer - Parameterized by a simple feed-forward network
-
Decoder - RNN with input from previous state + dynamic context vector
-
Self-Attention - Where Each Token Attends to All other tokens in the same Sequence
-
Multi-head Self-attention - Runs multiple self-attention mechanisms in Parallel to capture different relationships
-
Activation Functions - Non-linear functions after Neural Layers
-
Softmax:
\(\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\)
Used to normalize attention scores into a probability distribution over keys -
Tanh:
\(\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}\)
Maps input to the range ([-1, 1]), commonly used in RNNs and attention scoring -
ReLU - Rectified Linear Unit:
\(\text{ReLU}(x) = \max(0, x)\)
Introduces sparsity and alleviates the vanishing gradient problem -
GELU - Gaussian Error Linear Unit:
\(\text{GELU}(x) = x \cdot \Phi(x)\)
where \(\Phi(x) = \frac{1}{2} \left[ 1 + \text{erf} \left( \frac{x}{\sqrt{2}} \right) \right]\) is the standard Gaussian cumulative distribution function - CDF GELU is smoother than ReLU and is widely used in Transformers
-
-
Why need Positional Encodings - To give the model a sense of token order, since Transformers have no recurrence or convolution
-
Why Adding Residual Connections - To Ease Gradient Flow and Improve Training Stability in Deep Networks
-
Why Need Normalizations
- Stabilize + Accelerate Training
- Improve Generalization
- Handle scale variance across features
📍 Why Need Normalizations
- Stabilize and accelerate training by controlling the distribution of activations
- Improve generalization across tasks
- Handle scale variance across features, samples, and batches
- Enable training with higher learning rates without divergence
Techniques
- Layer Normalization
- Normalizes across all features within each token (sample-wise)
- Used in: Transformers, Speech Models (e.g., wav2vec2, ASR)
- Batch Normalization
- Normalizes each feature across the batch
- Used in: CNNs, MLPs, Image Classification (ResNet, VGG)
- Instance Normalization
- Normalizes each sample and channel separately
- Used in: Style Transfer, Image Generation
- Group Normalization
- Normalizes within groups of channels
- Used in: Vision tasks with small batches (e.g., segmentation, GANs)
- RMSNorm
- Root-mean-square-only scaling (no mean subtraction)
- Used in: Lightweight Transformers, TinyLMs
- Weight Normalization
- Normalizes weight vectors instead of activations
- Used in: Reinforcement Learning, Sparse Models
📍 Why Need Regularization
- Prevent overfitting to training data
- Improve robustness and generalization
- Avoid co-adaptation of neurons
- Stabilize weight growth and gradient flow
Techniques
- L1 Regularization (Lasso)
- Encourages sparsity, useful for feature selection
- L2 Regularization (Ridge)
- Penalizes large weights, discourages complexity
- Elastic Net
- Combines L1 + L2 for balanced sparsity + smoothness
- Dropout
- Randomly removes neurons during training to prevent co-adaptation
- DropConnect
- Randomly removes connections (weights), adds structural noise
- Stochastic Depth
- Randomly skips entire layers, improves ensemble-like diversity
- Early Stopping
- Halts training when validation loss stops improving
- Weight Decay
- Applies L2 penalty during optimizer update (e.g., AdamW)
- Label Smoothing
- Softens targets to avoid overconfidence in classification
- Data Augmentation
- Expands training data via noise, rotation, cropping, etc.
- Mixup / CutMix
- Mix inputs and/or regions from multiple samples for better decision boundaries
- Noise Injection
- Adds Gaussian noise to inputs or gradients for robustness
- Max-Norm Constraint
- Limits the norm of weights for regularized learning
- Gradient Clipping
- Prevents exploding gradients, especially in RNNs
References
-
ICASSP - IEEE Intl. Conf. on Acoustics, Speech and Signal Processing
-
TASLP - IEEE/ACM Trans. on Audio, Speech, and Language Processing
Enjoy Reading This Article?
Here are some more articles you might like to read next: