Welcome,

A Coding Demo Example for Distilled-ASR using xxx with - A Visual Interactive Demo - teach you about the Latent Space I / we used in the project

Apr 2025 - Generative modelling in latent space

A dynamic Latent Trajectory Visualization
with LoRA / QLoRA Adapter

📍

CTC
Seq2Seq+Attention
RNN-Transducer
QLoRA-Based Adapters
KL-Distillation
FLEURS - DataSet
on-device ASR 📍

References

2025 - Nature Human Behaviour - A unified acoustic-to-speech-to-language embedding space captures the neural basis of NLP

1989 - The Cascade-Correlation Learning Architecture

0. Some Background Knowledge

0.1 Core Evolution of Voice Models

Year	Milestone	Model / Paper
2014	End-to-end ASR	DeepSpeech (Hannun et al.)
2017	Tacotron (neural TTS)	Tacotron (Wang et al.)
2019	Real-time voice synthesis	FastSpeech (Ren et al.)
2020	Self-supervised	wav2vec 2.0 (Baevski et al.)
2022	Multilingual speech models	Whisper (OpenAI, 2022)
2023	Zero-shot voice cloning	VALL-E (Microsoft, 2023)
2023–2024	Diffusion-based TTS	FastDiff (Huang et al.)
2024	Multi-modal voice models	AudioLM 2 (Borsos et al.)

0.2 Key Technical History

Period	Model Category	Core Principle
2014–2017	RNN - LSTM / GRU	Sequence modeling, LSTM / GRU Solved vanishing gradient issues
2018–2020	Transformer / Conformer	Self-Attention + CNN, Parallelizable computation for Efficiency
2019–2022	GAN-based Models	TTS, Real-time audio Denoising for Hearing Aids
2021–Present	Diffusion Models	Zero-shot / Few-shot
Present	SSL / Lightweight	Self-supervised learning, Compression, Distillation

Diffusion Models
- 2015 Deep Unsupervised Learning using Nonequilibrium Thermodynamics
- 2020 Denoising Diffusion Probabilistic Models

SSL

📍 0.3 Label Encoding Methods in Speech Models and Distillation

Label encoding refers to methods that convert categorical labels into numerical representations for machine learning. These strategies have evolved to serve different training paradigms, including classification, multi-label tasks, and model compression (e.g., distillation).

Historical Context

One-hot encoding: Early standard in classification models.
Label/Ordinal/Binary encoding: Introduced for efficient encoding in decision trees and statistical models.
Embedding encoding: Emerged with deep learning to represent semantic relationships.
Soft label: Popularized in knowledge distillation (Hinton et al., 2015).

Common Label Encodings Overview

Method	Example	Relationship Preserved	Use Case
One-hot	`[0, 1, 0]`	❌	CTC loss in ASR
Integer Label	`0, 1, 2`	⚠️ (Implied order)	Tree models
Ordinal	`1, 2, 3`	✅	Ranked categories
Binary	`A=0001`	Partially	Large-ID categories
Embedding	`[0.1, -0.2, 0.3]`	✅ (Learned)	Token representation
Multi-hot	`[1, 0, 1]`	❌	Multi-label tasks
Soft label	`[0.1, 0.7, 0.2]`	✅	Distillation training

What Domain Does Label Encoding Belong To?

Label encoding methods are fundamental to many stages of machine learning pipelines, from raw data preprocessing to model compression.

Encoding Knowledge	Domain
Categorical label transformation	Machine Learning
One-hot, Binary, Ordinal encoding	Data Preprocessing / Feature Engineering
Embedding encoding	Representation Learning / NLP / Speech
Soft label distillation	Model Compression / Knowledge Transfer
Token label supervision	Deep Learning (CTC, ASR, Transformer)

These methods are crucial for enabling models to interpret, learn, and generalize from categorical data, especially in speech and language processing.

Historical Timeline and Motivation

Encoding Method	Introduced	Why It Was Introduced
One-hot Encoding	1960s–1970s	To represent categories without implying order; widely used in early neural nets and perceptrons
Label / Integer Encoding	1980s	Compact representation for tree models; useful in statistical and rule-based methods
Ordinal Encoding	1980s	Needed when categories have intrinsic order (e.g., low < medium < high)
Binary Encoding	1990s	To handle high-cardinality categories without exploding dimensionality (e.g., postal codes, product IDs)
Embedding Encoding	2013+	Emerged with Word2Vec and deep learning to learn semantic similarity between tokens
Soft Label (for Distillation)	2015 (Hinton et al.)	To enable compact student models to mimic richer knowledge from larger teachers
Multi-hot Encoding	2000s	Designed for multi-label classification tasks (e.g., image with multiple objects)

Most modern deep learning tasks—especially those involving transformers, adapters, or sequence models—use a combination of one-hot, embedding, and soft labels depending on the training phase.

1. Some Sample Models from Industry

1.1 - Self-supervised

1.2 - Zero-shot

1.3 - Diffusion-based

1.4 - Neural Audio Codec

**1.5 - Multi-modal** - will discuss in the future

2. Model Training

2.1 Pre-training with text

Spirit LM: Interleaved Spoken and Written Language Model
OpenAI - Navigating the Challenges and Opportunities of Synthetic Voices
Toward Joint Language Modeling for Speech Units and Text
Dialogue GSLM
📍 Model in use
- wav2vec2-large-robust-ft-libri-960h
- wav2vec 2.0
- Structure - CNN encoder + Transformer
- Original Task - CTC-based ASR - Automatic Speech Recognition
- base model
- Speech Recognition Pre-Training - Sample codes
- Some discussion

DataSet in use
- LibriSpeech ASR Corpus
- Hugging Face: https://huggingface.co/datasets/librispeech_asr
- OpenSLR: http://www.openslr.org/12

Sample Code

📍 2.2 Some Sample Models from the Industry for Fine-tuning

Language - Tülu 3 - by Ai2 - 2024
- Tech Report
- OLMo 2 - Language models
- Open Source

Speech -

2.3 Post-Training

Pre-Train Style
- 📍 Distillation
  - Loss function $\mathcal{L}_{\text{total}} = \alpha \cdot \mathcal{L}_{\text{CE}}(y_s, y_{\text{true}}) + \beta \cdot \mathcal{L}_{\text{KL}}(y_s^T, y_t^T)$
  Where:
  - $y_{\text{true}}$ : Ground-truth hard label
  - $y_s$ : Student output logits
  - $y_s^{(T)}$ : Student soft prediction with temperature $T$
  - $y_t^{(T)}$ : Teacher soft prediction with temperature $T$
  - $\alpha, \beta$ : Weight coefficients, used to balance supervised learning and distillation objectives
- SSL
- demo 1
- Why Distillation
  - To reduce Model Size + Inference Latency
  - Issues may occur - Information Loss + Capacity Bottleneck (can be sloved)
Supervised-Fine-Tuning Style
- 📍 Adapter - lightweight
- Prompt-tuning
- demo 2
- Why Adapter
  - To Fine-tune Efficiently by training Only A Small Set of Parameters
  - Issues may occur - Task Mismatch with the Pre-training Model + Output Alignment (Enhancement tasks require frame-level supervision, but wav2vec2 is trained for sequence-to-sequence token prediction)

Reinforcement-Learning Style
- RLHF
Others - Generative Enhancement Style
- DNN-GAN for Speech Denoising

Quantization vs. Pruning vs. Distillation - Optimizing NNs for Inference

2.4 Some Parameters and Optimatzers

3. Possible Improvements to the Foundation Models / During Fine-Tuning

3.1 Catastrophic Forgetting

Possible Solutions 🪨

3.2 Task Targeted Post-training will Degrade the model’s performance on other Tasks - e.g. Safety Alignment

Supervised-Fine-Tuning Style Post-training - 2024 Self-Distillation Bridges Distribution Gap in Language Model Fine-Tuning
2023 Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions

Possible Solutions 🪨

4. Recent Technical Advances - pay attention to the 📍 ones

2025 – Efficient Distillation of Classifier-Free Guidance using Adapters
2025 – Neuralink – gets FDA nod for chip that will let speech impaired people speak, human trials soon

This includes those affected by ALS, stroke, spinal cord injury, cerebral palsy, multiple sclerosis, and other neurological conditions.
2024 – RAST: A Reference-Audio Synchronization Tool for Dubbed Content
📍 2024 – Fast Timing-Conditioned Latent Audio Diffusion
2023 – Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

Proposes a novel approach to text-to-speech synthesis using minimal supervision while maintaining high fidelity, making TTS systems more accessible for low-resource settings.
2023 - Language-Universal Adapter Learning with Knowledge Distillation for End-to-End Multilingual Speech Recognition
2023 – Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
2023 – Voicebox: Versatile Generative Speech AI (Meta)

A generative model capable of text-to-speech, style transfer, noise removal, and speech editing using just 2 seconds of input audio.
2023 – VALL-E: Zero-Shot Text-to-Speech via Neural Codec Language Modeling

Achieves personalized speech synthesis from a 3-second voice sample, preserving emotion and acoustic context in zero-shot TTS tasks.
2023 – Apple Personal Voice & Live Speech

Allows users to generate a personal synthetic voice using only 15 minutes of audio, aiding those at risk of speech loss due to ALS or other conditions.
2023 – Meta Massively Multilingual Speech (MMS)

Open-source speech-to-text and text-to-speech models for 1,100+ languages, massively expanding multilingual accessibility in speech AI.
📍 2022 – Whisper: Multilingual ASR via Large-Scale Weak Supervision

A general-purpose speech recognition system trained on 680,000 hours of audio, robust across accents, background noise, and multiple languages.
2022 - OPRA-RS: A Hearing-Aid Fitting Method Based on Automatic Speech Recognition and Random Search
📍 2021 - RNN Based Incremental Online Spoken Language Understanding
2021 – Apple On-Device Speech Recognition for Siri

Introduced local processing of Siri speech recognition, enhancing privacy and enabling offline voice commands.
📍 2020 – wav2vec 2.0: Self-Supervised Learning of Speech Representations

Demonstrated state-of-the-art ASR using very limited labeled data via self-supervised learning on large-scale unlabeled audio.
📍 2020 – Conformer: Convolution-augmented Transformer for ASR

Combined CNNs and Transformers for effective modeling of both local and global features in speech, improving ASR accuracy.
2019 – Project Euphonia (Google)

Uses AI to improve ASR for users with atypical speech, such as those with ALS or other disorders, enhancing speech accessibility.
2016 – WaveNet: A Generative Model for Raw Audio

Introduced deep generative modeling of raw audio, setting a new bar for natural-sounding speech synthesis.
2015 – Deep Speech 2: End-to-End Speech Recognition in English and Mandarin

Demonstrated that deep learning can perform ASR across languages and noisy conditions without hand-engineered features.

5. Products for Disabled People / Hearing Aid Enhancement

5.1 Key References

5.2 Aspects

Model Adaptability
Computational Efficiency
Customization / Personalization

5.3 Products

2024 Hearing Tracker - Hearing Aids with Artificial Intelligence (AI): Review of Features, Capabilities and Models that Use AI and Machine Learning
2023 DNN - Restoring speech intelligibility for hearing aid users with deep learning

6. Some Startups

AudioShake
- Key Tech
  - Stem Separation - CNN / RNN
  - Supervised Fine-Tuning - Utilizes labeled datasets to train models for separating different audio stems (e.g., vocals, drums)
  - Transfer Learning - Leverages pre-trained models on large audio datasets, adapting them to specific stem separation tasks
- Markets
  - Music production, film and television post-production, podcast editing, game audio processing, user-generated content - UGC
ElevenLabs
- Key Tech
  - TTS, Voice Cloning, Voice Conversion, STT - DNN
  - Supervised Fine-Tuning - Trains models on paired text and speech data to generate natural-sounding speech
  - Voice Cloning - Adapts models to replicate specific voices using limited voice samples
  - Multilingual Fine-Tuning - Extends models to support multiple languages by fine-tuning on diverse linguistic datasets.
- Markets
  - Audiobooks, podcast production, game dubbing, virtual assistants, educational content, film and television dubbing
- Python SDK
LiveKit
- Key Tech
  - Real-Time Communication Platform, Voice AI Agent Framework, Edge Infrastructure
  - Transformer / DNN / VAD
  - Supervised Fine-Tuning
  - Transfer Learning
- Markets
  - Live video conferencing, voice chat, virtual events, online education, customer support
- livekit
RealAvatar.ai
- Key Tech
  - Multimodal AI Interaction, AI Avatars - DNN / Transformer
  - Supervised Fine-Tuning
  - Transfer Learning
- Markets
  - Education and training, customer service, virtual assistant, online consultation, content creation

7. Some Terms and their Nature

Attention - Vector of Importance Weights
Encoder - Bidirectional RNN
Activation Recomputation / Gradient Checkpoint - Memory-saving technique - Save “important activations” during Forward pass, then recompute / computation overhead when needed in backward pass - Typically Save 30–70% GPU Memory, depending on model depth and recompute granularity
Additive Attention - 2014 Neural Machine Translation by Jointly Learning to Align and Translate
Dot-product Attention - 2023 Attention Is All You Need

Additive Attention

Additive Attention computes the attention scores using a feed-forward neural network

\[e_i = \mathbf{v}^T \tanh(\mathbf{W}_1 \mathbf{q} + \mathbf{W}_2 \mathbf{k}_i)\] \[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\] \[\mathbf{c} = \sum_i \alpha_i \mathbf{v}_i\]

Dot-Product Attention

Dot-Product Attention calculates the attention scores by taking the dot product of the query and key vectors

\[e_i = \mathbf{q}^T \mathbf{k}_i\] \[\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}\] \[\mathbf{c} = \sum_i \alpha_i \mathbf{v}_i\]

Scaled Dot-Product Attention

To mitigate the issue of large dot product values in high-dimensional spaces, Scaled Dot-Product Attention scales the dot products

\[\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left( \frac{\mathbf{Q} \mathbf{K}^T}{\sqrt{d_k}} \right) \mathbf{V}\]

Attention Layer - Parameterized by a simple feed-forward network
Decoder - RNN with input from previous state + dynamic context vector
Tensor2Tensor Notebook
Self-Attention - Where Each Token Attends to All other tokens in the same Sequence
Multi-head Self-attention - Runs multiple self-attention mechanisms in Parallel to capture different relationships
Activation Functions - Non-linear functions after Neural Layers
- Softmax:
  $\alpha_i = \frac{\exp(e_i)}{\sum_j \exp(e_j)}$
  Used to normalize attention scores into a probability distribution over keys
- Tanh:
  $\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$
  Maps input to the range ([-1, 1]), commonly used in RNNs and attention scoring
- ReLU - Rectified Linear Unit:
  $\text{ReLU}(x) = \max(0, x)$
  Introduces sparsity and alleviates the vanishing gradient problem
- GELU - Gaussian Error Linear Unit:
  $\text{GELU}(x) = x \cdot \Phi(x)$
  where $\Phi(x) = \frac{1}{2} \left[ 1 + \text{erf} \left( \frac{x}{\sqrt{2}} \right) \right]$ is the standard Gaussian cumulative distribution function - CDF GELU is smoother than ReLU and is widely used in Transformers

Why need Positional Encodings - To give the model a sense of token order, since Transformers have no recurrence or convolution
Why Adding Residual Connections - To Ease Gradient Flow and Improve Training Stability in Deep Networks
Why Need Normalizations
- Stabilize + Accelerate Training
- Improve Generalization
- Handle scale variance across features

📍 Why Need Normalizations

Stabilize and accelerate training by controlling the distribution of activations
Improve generalization across tasks
Handle scale variance across features, samples, and batches
Enable training with higher learning rates without divergence

Techniques

Layer Normalization
- Normalizes across all features within each token (sample-wise)
- Used in: Transformers, Speech Models (e.g., wav2vec2, ASR)
Batch Normalization
- Normalizes each feature across the batch
- Used in: CNNs, MLPs, Image Classification (ResNet, VGG)
Instance Normalization
- Normalizes each sample and channel separately
- Used in: Style Transfer, Image Generation
Group Normalization
- Normalizes within groups of channels
- Used in: Vision tasks with small batches (e.g., segmentation, GANs)
RMSNorm
- Root-mean-square-only scaling (no mean subtraction)
- Used in: Lightweight Transformers, TinyLMs
Weight Normalization
- Normalizes weight vectors instead of activations
- Used in: Reinforcement Learning, Sparse Models

📍 Why Need Regularization

Prevent overfitting to training data
Improve robustness and generalization
Avoid co-adaptation of neurons
Stabilize weight growth and gradient flow

Techniques

L1 Regularization (Lasso)
- Encourages sparsity, useful for feature selection
L2 Regularization (Ridge)
- Penalizes large weights, discourages complexity
Elastic Net
- Combines L1 + L2 for balanced sparsity + smoothness
Dropout
- Randomly removes neurons during training to prevent co-adaptation
DropConnect
- Randomly removes connections (weights), adds structural noise
Stochastic Depth
- Randomly skips entire layers, improves ensemble-like diversity
Early Stopping
- Halts training when validation loss stops improving
Weight Decay
- Applies L2 penalty during optimizer update (e.g., AdamW)
Label Smoothing
- Softens targets to avoid overconfidence in classification
Data Augmentation
- Expands training data via noise, rotation, cropping, etc.
Mixup / CutMix
- Mix inputs and/or regions from multiple samples for better decision boundaries
Noise Injection
- Adds Gaussian noise to inputs or gradients for robustness
Max-Norm Constraint
- Limits the norm of weights for regularized learning
Gradient Clipping
- Prevents exploding gradients, especially in RNNs

Speech Processing x Transfer Learning - 25

References

0. Some Background Knowledge

Historical Context

Common Label Encodings Overview

What Domain Does Label Encoding Belong To?

Historical Timeline and Motivation

1. Some Sample Models from Industry

2. Model Training

Sample Code

3. Possible Improvements to the Foundation Models / During Fine-Tuning

4. Recent Technical Advances - pay attention to the 📍 ones

5. Products for Disabled People / Hearing Aid Enhancement

6. Some Startups

7. Some Terms and their Nature

Additive Attention

Dot-Product Attention

Scaled Dot-Product Attention

📍 Why Need Normalizations

Techniques

📍 Why Need Regularization

Techniques

References

Enjoy Reading This Article?