Structures for Speech Processing - 25
- Neural Structures + Signal Terms
CNN
RNN
LSTM
GRU
Purned LSTM
Transformer
Conformer
Flow Matching
Embedding Space
Signal Processing
Traditional RNN Seq2Seq Transformer
βββββββββββββββββββββ ββββββββββββββ
Input Sequence Input Sequence
xβ, xβ, ..., xβ xβ, xβ, ..., xβ
β β
βΌ βΌ
Sequential Processing Parallel Processing
hβ β hβ β ... β hβ All positions simultaneously
(SLOW, O(n) steps) (FAST, O(1) steps)
β β
βΌ βΌ
RNN Encoder Transformer Encoder
Hidden states: hβ, hβ, ..., hβ Self-Attention + FFN
β β
βΌ βΌ
Attention Mechanism Encoder-Decoder Attention
β β
βΌ βΌ
RNN Decoder Transformer Decoder
yβ β yβ β ... β yβ Masked Self-Attention + FFN
(Sequential generation) (Parallel training possible)
xβ β xβ β ... β xβ
β β β
hβ hβ ... hβ
βοΌsoft attentionοΌ
context = Ξ£ Ξ±β Β· hβ
β
Decoder yβ, yβ, ...
Encoder
xβ β xβ β β¦ β xβ
β β β
hβ hβ β¦ hβ
β β β
βββββββΌβββββββββ
β
βββββ΄ββββ
β β
βΌ βΌ
Soft Attention (differentiable) Hard Attention (non-differentiable)
β β’ Complexity: O(n) per step β β’ Complexity: O(1) per step
β β’ Stability & Performance: β β’ Stability & Performance:
β gradient-smooth, training- β quickly locks onto a few
β stable; averages teacher β high-probability positions
β signal β more global & robust. β β sparse and precise but
β β may suffer mode-collapse.
β β
βΌ βΌ
Ξ±β = softmax(eβ) tΜ βΌ Categorical(Ξ±β)
β β
βΌ βΌ
c_soft = Ξ£ Ξ±βΒ·hβ c_hard = h_{tΜ}
β [O(n) computation] β [O(1) computation]
βΌ βΌ
Standard back-propagation REINFORCE / Gumbel-Softmax
β [O(n) gradient flow] β [O(1) gradient estimation]
βΌ βΌ
Decoder generates yβ, yβ, β¦ Decoder generates yβ, yβ, β¦
Total Training Complexity: Total Training Complexity:
O(nΒ·T) for sequence length T O(T) for sequence length T
T: output sequence length (number of decoding steps)
π³οΈ Soft Attention (Democratic Voting) π Hard Attention (Expert Selection)
ββββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββ
Each decoding step: Each decoding step:
βββββββββββββββββββ βββββββββββββββββββ
β ποΈ Public Poll β β π― Expert Pick β
βββββββββββββββββββ βββββββββββββββββββ
β β
βΌ βΌ
Consult ALL voters: 4-Step Expert Selection:
βββββββ¬ββββββ¬ββββββ¬ββββββ βββββββ¬ββββββ¬ββββββ¬ββββββ
β hβ β hβ β hβ β hβ β β hβ β hβ β hβ β hβ β
β π£οΈ β π£οΈ β π£οΈ β π£οΈ β β π΄ β π― β π΄ β π΄ β
β0.2 β0.5 β0.2 β0.1 β βScoreβPICK β β β
βββββββ΄ββββββ΄ββββββ΄ββββββ βββββββ΄ββββββ΄ββββββ΄ββββββ
β β
βΌ βΌ
Weighted average of ALL opinions: Expert Selection Process:
c = 0.2Γhβ + 0.5Γhβ + 0.2Γhβ + 0.1Γhβ 1. eα΅’ = score(query, hα΅’)
2. pα΅’ = softmax(eα΅’)
3. i* ~ Categorical(p)
4. c = h_{i*}
π° Cost: O(n) - Expensive π° Cost: O(1) - Cheap
π Stability: High π Stability: Requires special training
π― Precision: Global view but may blur π― Precision: Focused but may miss info
Key: Probabilistic expertise - highest scorer Key: Democratic consensus - everyone
usually wins, but allows exploration contributes proportionally
Softmax
Extreme Case Analysis
βββββββββββββββββββββ
Input: [1, 10, 100]
β
βΌ
Simple Normalization Path: Softmax Exponential Path:
Step 1: Sum = 1+10+100 = 111 Step 1: Exponential Explosion
Step 2: [1/111, 10/111, 100/111] exp(1) = 2.72
Step 3: [0.009, 0.09, 0.90] exp(10) = 22,026 β HUGE!
exp(100) = 2.7Γ10β΄Β³ β ASTRONOMICAL!
Result: Still some competition β
βΌ
Step 2: Total dominance
Sum β 2.7Γ10β΄Β³ (exp(100) dominates)
β
βΌ
Step 3: [tiny, tiny, ~1.0]
Result: [0.00004, 0.00009, 0.999]
Winner takes EVERYTHING!
softmax(xα΅’) = exp(xα΅’) / Ξ£β±Ό exp(xβ±Ό)
softmax(x, T) = exp(x/T) / Ξ£ exp(x/T)
-> Why e?
d/dx(eΛ£) = eΛ£
FFN
βββββββββββββββββββββββββββ
β Input Embeddings β
β + Position Encoding β
βββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Multi-Head β
β Self-Attention β β
βββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Add & Norm β
β (Residual + LayerNorm) β
βββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Position-wise β
β Feed-Forward Network β β
β (FFN) β
βββββββββββ¬ββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Add & Norm β
β (Residual + LayerNorm) β
βββββββββββ¬ββββββββββββββββ
β
βΌ
Raw Input Processed Input
βββββββββ ββββββββββββββ
Text: "Hello world" Token Embeddings
β β
βΌ βΌ
Tokenization Linear Projection
["Hello", "world"] E β βα΅Λ£α΅α΅α΅α΅α΅Λ‘
β β
βΌ βΌ
Token IDs Position Encoding
[1045, 2088] PE(pos,2i) = sin(pos/10000^(2i/d_model))
β PE(pos,2i+1) = cos(pos/10000^(2i/d_model))
βΌ β
Embedding Lookup βΌ
xβ = E[1045], xβ = E[2088] Final Input = Embedding + Position
(each x β βα΅α΅α΅α΅α΅Λ‘) X = [xβ+PEβ, xβ+PEβ, ...]
Multi-Head Self-Attention Computation
βββββββββββββββββββββββββββββββββββββ
Input: X β ββΏΛ£α΅α΅α΅α΅α΅Λ‘ Single Head Attention
β β
βΌ βΌ
Linear Projections Scaled Dot-Product Attention
Q = XW_Q β ββΏΛ£α΅β Attention(Q,K,V) = softmax(QKα΅/βdβ)V
K = XW_K β ββΏΛ£α΅β β
V = XW_V β ββΏΛ£α΅α΅₯ βΌ
β Score Matrix: S = QKα΅/βdβ
βΌ S[i,j] = similarity(query_i, key_j)
Parallel Processing (h=8 heads) β
Headβ: Qβ, Kβ, Vβ (dβ=dα΅₯=64) βΌ
Headβ: Qβ, Kβ, Vβ (dβ=dα΅₯=64) Attention Weights: A = softmax(S)
... A[i,j] = how much position i attends to j
Headβ: Qβ, Kβ, Vβ (dβ=dα΅₯=64) β
β βΌ
βΌ Weighted Values: Z = AV
Concatenate Heads Z[i] = Ξ£β±Ό A[i,j] * V[j]
Z = Concat(headβ, ..., headβ) β
β βΌ
βΌ Output β ββΏΛ£α΅α΅₯
Final Projection
Output = ZW_O β ββΏΛ£α΅α΅α΅α΅α΅Λ‘
Single Encoder Layer Computational Flow
ββββββββββββββββββββ βββββββββββββββββ
Input: X β ββΏΛ£α΅α΅α΅α΅α΅Λ‘ X (sequence representation)
β β
βΌ βΌ
Multi-Head Self-Attention Calculate attention scores
Attention(X, X, X) for all position pairs
β β
βΌ βΌ
Residual Connection Add & Norm
Xβ = LayerNorm(X + Attention(X)) Xβ = LayerNorm(X + SubLayerβ(X))
β β
βΌ βΌ
Position-wise FFN Two linear transformations
FFN(x) = max(0, xWβ + bβ)Wβ + bβ ReLU activation between them
Wβ β βα΅α΅α΅α΅α΅Λ‘Λ£α΅ff, Wβ β βα΅ffΛ£α΅α΅α΅α΅α΅Λ‘ Hidden dim: dff = 2048
β β
βΌ βΌ
Residual Connection Add & Norm
Xβ = LayerNorm(Xβ + FFN(Xβ)) Xβ = LayerNorm(Xβ + SubLayerβ(Xβ))
β β
βΌ βΌ
Output: Xβ β ββΏΛ£α΅α΅α΅α΅α΅Λ‘ Ready for next layer
Encoder Stack (N=6 layers) Information Flow
ββββββββββββββββββββββββββ βββββββββββββββ
Input Embeddings + Positional Encoding Raw sequence information
β β
βΌ βΌ
Encoder Layer 1 Local attention patterns
Multi-Head Attention + FFN learned in first layer
β β
βΌ βΌ
Encoder Layer 2 More complex patterns
Multi-Head Attention + FFN building on previous layer
β β
βΌ βΌ
Encoder Layer 3 Hierarchical feature
Multi-Head Attention + FFN extraction continues
β β
βΌ βΌ
... (up to Layer 6) Deep semantic
representations
β β
βΌ βΌ
Final Encoder Output Rich contextual
Z = [zβ, zβ, ..., zβ] β ββΏΛ£α΅α΅α΅α΅α΅Λ‘ representations
Decoder Layer Masked Self-Attention
βββββββββββββ βββββββββββββββββββββ
Target Sequence (Training) Attention Mask Matrix
y = [<START>, wβ, wβ, ..., wβ] β 1 0 0 0 β β position 1 only sees itself
β β 1 1 0 0 β β position 2 sees 1,2
βΌ β 1 1 1 0 β β position 3 sees 1,2,3
Shifted Right (Teacher Forcing) β 1 1 1 1 β β position 4 sees 1,2,3,4
Input: [<START>, wβ, wβ, ...] β
Target: [wβ, wβ, wβ, ...] βΌ
β Prevents "looking ahead"
βΌ during training
Masked Self-Attention β
Only attend to previous positions βΌ
β Ensures autoregressive property
βΌ is maintained during training
Encoder-Decoder Attention
Query: from decoder
Key, Value: from encoder output Z
β
βΌ
Position-wise FFN
Same as encoder FFN
β
βΌ
Output Predictions
Scaled Dot-Product Attention Computation
βββββββββββββββββββββββββββββββββββββββ
Mathematical Formula Step-by-Step Calculation
ββββββββββββββββββ ββββββββββββββββββββββββ
Attention(Q,K,V) = softmax(QKα΅/βdβ)V Given:
Q β ββΏΛ£α΅β (queries)
Where: K β βα΅Λ£α΅β (keys)
Q = queries V β βα΅Λ£α΅α΅₯ (values)
K = keys β
V = values βΌ
dβ = key dimension Step 1: Compute dot products
S = QKα΅ β ββΏΛ£α΅
Scaling Factor: 1/βdβ S[i,j] = Ξ£β Q[i,k] * K[j,k]
Why? Prevents saturation of softmax β
when dβ is large βΌ
Step 2: Scale by βdβ
Example with dβ=64: S_scaled = S / β64 = S / 8
βdβ = β64 = 8 β
βΌ
Step 3: Apply softmax
A[i,j] = exp(S_scaled[i,j]) / Ξ£β exp(S_scaled[i,k])
β
βΌ
Step 4: Weighted sum of values
Output[i] = Ξ£β±Ό A[i,j] * V[j]
Multi-Head Attention Detailed Computation
ββββββββββββββββββββββββββββββββββββββββ
Single Model vs Multi-Head Parallel Head Processing
βββββββββββββββββββββ βββββββββββββββββββββ
Single Attention (dββdββ=512) Head 1 (dβ=dα΅₯=64):
All 512 dimensions together Qβ = XWβQ, Kβ = XWβK, Vβ = XWβV
Limited representation power Aβ = Attention(Qβ, Kβ, Vβ)
β β
βΌ βΌ
Multi-Head (h=8, dβ=dα΅₯=64) Head 2 (dβ=dα΅₯=64):
Each head: 512/8 = 64 dimensions Qβ = XWβQ, Kβ = XWβK, Vβ = XWβV
Different representation subspaces Aβ = Attention(Qβ, Kβ, Vβ)
β β
βΌ βΌ
Benefits: ... (parallel computation)
- Attend to different types of info β
- Different positions simultaneously βΌ
- Syntactic vs semantic relationships Head 8 (dβ=dα΅₯=64):
Qβ = XWβQ, Kβ = XWβK, Vβ = XWβV
Concatenation: Aβ = Attention(Qβ, Kβ, Vβ)
MultiHead = Concat(Aβ, Aβ, ..., Aβ) β
Final projection: MultiHeadWβ βΌ
Final Output:
Concat(Aβ, Aβ, ..., Aβ)Wβ
Position Encoding Necessity Sinusoidal Position Encoding
βββββββββββββββββββββββββ ββββββββββββββββββββββββββ
Problem: Attention is permutation Formula:
equivariant PE(pos,2i) = sin(pos/10000^(2i/dββdββ))
PE(pos,2i+1) = cos(pos/10000^(2i/dββdββ))
"cat sat mat" vs "mat sat cat" β
Same attention weights! βΌ
β Properties:
βΌ - Different wavelengths for each dim
Need position information - Relative position relationships
β - Can handle variable lengths
βΌ β
Two options: βΌ
1. Learned embeddings Example (dββdββ=4):
2. Sinusoidal encoding pos=0: [sin(0/10000^0), cos(0/10000^0),
β sin(0/10000^0.5), cos(0/10000^0.5)]
βΌ pos=1: [sin(1/10000^0), cos(1/10000^0),
Transformer choice: Sinusoidal sin(1/10000^0.5), cos(1/10000^0.5)]
Reason: Generalization to longer sequences β
βΌ
Added to word embeddings:
Final_input = Word_emb + Pos_emb
Training Process Loss Computation
βββββββββββββββ βββββββββββββββββ
Teacher Forcing Mode Cross-Entropy Loss
Input: [<START>, "Hello", "world"] Given target: ["Hello", "world", <END>]
Target: ["Hello", "world", <END>] Predicted: [Pβ, Pβ, Pβ]
β β
βΌ βΌ
Parallel Training Loss = -Ξ£α΅’ log(Pα΅’[correct_tokenα΅’])
All positions computed simultaneously Where Pα΅’ is probability distribution
(unlike RNN sequential) over vocabulary at position i
β β
βΌ βΌ
Decoder Output: Label Smoothing (Ξ΅=0.1):
Logits β βα΅α΅αΆα΅α΅ for each position Instead of hard targets [0,0,1,0,...]
β Use soft: [0.025, 0.025, 0.9, 0.025,...]
βΌ β
Softmax β Probabilities βΌ
P = softmax(logits) Improves generalization
β Prevents overconfidence
βΌ
Compute loss with ground truth
Inference Process (Autoregressive) Step-by-Step Generation
βββββββββββββββββββββββββββββ ββββββββββββββββββββββ
Different from Training! Step 1: Start token
No teacher forcing Input: [<START>]
β Decoder output: Pβ
βΌ Sample: wβ = argmax(Pβ)
Sequential Generation β
Each step depends on previous βΌ
β Step 2: Extend sequence
βΌ Input: [<START>, wβ]
Step 1: Decoder output: Pβ
Input: [<START>] Sample: wβ = argmax(Pβ)
Output: wβ β
β βΌ
βΌ Step 3: Continue...
Step 2: Input: [<START>, wβ, wβ]
Input: [<START>, wβ] Output: wβ
Output: wβ β
β βΌ
βΌ Until: <END> token generated
Continue until <END> or max_length reached
β
βΌ
Final sequence: [wβ, wβ, ..., wβ]
Efficiency Note:
Can reuse previous computations
with KV-caching in practice
Enjoy Reading This Article?
Here are some more articles you might like to read next: