Structures for Speech Processing - 25

  • Neural Structures + Signal Terms



CNN


RNN


LSTM


GRU


Purned LSTM


Transformer


Conformer


Flow Matching


Embedding Space


Signal Processing



Traditional RNN Seq2Seq                           Transformer
─────────────────────                            ──────────────

Input Sequence                                   Input Sequence
x₁, xβ‚‚, ..., xβ‚™                                  x₁, xβ‚‚, ..., xβ‚™
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Sequential Processing                             Parallel Processing
h₁ β†’ hβ‚‚ β†’ ... β†’ hβ‚™                               All positions simultaneously
(SLOW, O(n) steps)                               (FAST, O(1) steps)
     β”‚                                               β”‚
     β–Ό                                               β–Ό
RNN Encoder                                       Transformer Encoder
Hidden states: h₁, hβ‚‚, ..., hβ‚™                   Self-Attention + FFN
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Attention Mechanism                               Encoder-Decoder Attention
     β”‚                                               β”‚
     β–Ό                                               β–Ό
RNN Decoder                                       Transformer Decoder
y₁ β†’ yβ‚‚ β†’ ... β†’ yβ‚˜                               Masked Self-Attention + FFN
(Sequential generation)                           (Parallel training possible)


x₁ β†’ xβ‚‚ β†’ ... β†’ xβ‚™
 ↓     ↓        ↓
h₁    hβ‚‚     ... hβ‚™
       β†“οΌˆsoft attentionοΌ‰
   context = Ξ£ Ξ±β‚œ Β· hβ‚œ
       ↓
   Decoder y₁, yβ‚‚, ...


\[\begin{aligned} \text{Soft Attention:}\quad &\mathbf{c}_{\text{soft}} \;=\; \sum_{t=1}^{n} \alpha_t\,\mathbf{h}_t, \qquad \alpha_t \;=\; \frac{\exp(e_t)}{\displaystyle\sum_{k=1}^{n}\exp(e_k)}, \qquad e_t = f\!\bigl(\mathbf{q},\mathbf{k}_t\bigr) \\[1em] \text{Hard Attention:}\quad &\tilde t \;\sim\; \operatorname{Categorical}\!\bigl(\alpha_1,\dots,\alpha_n\bigr), \qquad \mathbf{c}_{\text{hard}} \;=\; \mathbf{h}_{\tilde t}, \\[4pt] &\mathcal{L}_{\text{hard}} \;=\; \mathbb{E}_{\tilde t\sim\alpha}\!\bigl[ \ell\!\bigl(\mathbf{c}_{\text{hard}},\text{target}\bigr) \bigr] \;\; \text{(optimized via REINFORCE or Gumbel-Softmax).} \end{aligned}\]


Encoder
x₁ β†’ xβ‚‚ β†’ … β†’ xβ‚™
 ↓     ↓        ↓
h₁    hβ‚‚   …   hβ‚™
 β”‚     β”‚        β”‚
 β””β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”˜
       β”‚
   β”Œβ”€β”€β”€β”΄β”€β”€β”€β”
   β”‚       β”‚
   β–Ό       β–Ό

Soft Attention (differentiable)      Hard Attention (non-differentiable)
β”‚  β€’ Complexity: O(n) per step       β”‚  β€’ Complexity: O(1) per step
β”‚  β€’ Stability & Performance:        β”‚  β€’ Stability & Performance:
β”‚    gradient-smooth, training-      β”‚    quickly locks onto a few
β”‚    stable; averages teacher        β”‚    high-probability positions
β”‚    signal β†’ more global & robust.  β”‚    β†’ sparse and precise but
β”‚                                    β”‚    may suffer mode-collapse.
β”‚                                    β”‚
β–Ό                                    β–Ό
Ξ±β‚œ = softmax(eβ‚œ)                     tΜƒ ∼ Categorical(Ξ±β‚œ)
β”‚                                    β”‚
β–Ό                                    β–Ό
c_soft  = Ξ£ Ξ±β‚œΒ·hβ‚œ                    c_hard = h_{tΜƒ}
β”‚  [O(n) computation]                β”‚  [O(1) computation]
β–Ό                                    β–Ό
Standard back-propagation            REINFORCE / Gumbel-Softmax
β”‚  [O(n) gradient flow]              β”‚  [O(1) gradient estimation]
β–Ό                                    β–Ό
Decoder generates y₁, yβ‚‚, …          Decoder generates y₁, yβ‚‚, …

Total Training Complexity:           Total Training Complexity:
O(nΒ·T) for sequence length T         O(T) for sequence length T

T: output sequence length (number of decoding steps)


πŸ—³οΈ Soft Attention (Democratic Voting)           πŸ‘‘ Hard Attention (Expert Selection)
════════════════════════════════               ═══════════════════════════

Each decoding step:                            Each decoding step:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  πŸ›οΈ Public Poll β”‚                            β”‚  🎯 Expert Pick β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                              β”‚
         β–Ό                                              β–Ό
Consult ALL voters:                            4-Step Expert Selection:
β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”                      β”Œβ”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”
β”‚ h₁  β”‚ hβ‚‚  β”‚ h₃  β”‚ hβ‚„  β”‚                      β”‚ h₁  β”‚ hβ‚‚  β”‚ h₃  β”‚ hβ‚„  β”‚
β”‚ πŸ—£οΈ  β”‚ πŸ—£οΈ  β”‚ πŸ—£οΈ  β”‚ πŸ—£οΈ  β”‚                      β”‚ 😴  β”‚ 🎯  β”‚ 😴  β”‚ 😴  β”‚
β”‚0.2  β”‚0.5  β”‚0.2  β”‚0.1  β”‚                      β”‚Scoreβ”‚PICK β”‚     β”‚     β”‚
β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜                      β””β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”˜
         β”‚                                              β”‚
         β–Ό                                              β–Ό
Weighted average of ALL opinions:              Expert Selection Process:
c = 0.2Γ—h₁ + 0.5Γ—hβ‚‚ + 0.2Γ—h₃ + 0.1Γ—hβ‚„         1. eα΅’ = score(query, hα΅’)
                                               2. pα΅’ = softmax(eα΅’) 
                                               3. i* ~ Categorical(p)
                                               4. c = h_{i*}

πŸ’° Cost: O(n) - Expensive                      πŸ’° Cost: O(1) - Cheap  
πŸ“Š Stability: High                             πŸ“Š Stability: Requires special training
🎯 Precision: Global view but may blur         🎯 Precision: Focused but may miss info

Key: Probabilistic expertise - highest scorer  Key: Democratic consensus - everyone 
     usually wins, but allows exploration           contributes proportionally


Softmax

Extreme Case Analysis
─────────────────────

Input: [1, 10, 100]
         β”‚
         β–Ό
Simple Normalization Path:              Softmax Exponential Path:
Step 1: Sum = 1+10+100 = 111           Step 1: Exponential Explosion
Step 2: [1/111, 10/111, 100/111]             exp(1) = 2.72
Step 3: [0.009, 0.09, 0.90]                  exp(10) = 22,026  ← HUGE!
                                              exp(100) = 2.7Γ—10⁴³  ← ASTRONOMICAL!
Result: Still some competition                    β”‚
                                                  β–Ό
                                         Step 2: Total dominance
                                         Sum β‰ˆ 2.7Γ—10⁴³ (exp(100) dominates)
                                                  β”‚
                                                  β–Ό
                                         Step 3: [tiny, tiny, ~1.0]
                                         Result: [0.00004, 0.00009, 0.999]
                                         
                                         Winner takes EVERYTHING!
softmax(xα΅’) = exp(xα΅’) / Ξ£β±Ό exp(xβ±Ό)
softmax(x, T) = exp(x/T) / Ξ£ exp(x/T)

-> Why e?
d/dx(eΛ£) = eΛ£


FFN

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Input Embeddings      β”‚
β”‚   + Position Encoding   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Multi-Head             β”‚
β”‚  Self-Attention         β”‚ ← 
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Add & Norm             β”‚
β”‚  (Residual + LayerNorm) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Position-wise          β”‚
β”‚  Feed-Forward Network   β”‚ ← 
β”‚  (FFN)                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Add & Norm             β”‚
β”‚  (Residual + LayerNorm) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          β”‚
          β–Ό 


Raw Input                                        Processed Input
─────────                                       ──────────────

Text: "Hello world"                             Token Embeddings
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Tokenization                                    Linear Projection
["Hello", "world"]                              E ∈ β„α΅›Λ£α΅ˆα΅α΅’α΅ˆα΅‰Λ‘
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Token IDs                                       Position Encoding
[1045, 2088]                                    PE(pos,2i) = sin(pos/10000^(2i/d_model))
     β”‚                                          PE(pos,2i+1) = cos(pos/10000^(2i/d_model))
     β–Ό                                               β”‚
Embedding Lookup                                     β–Ό
x₁ = E[1045], xβ‚‚ = E[2088]                     Final Input = Embedding + Position
(each x ∈ β„α΅ˆα΅α΅’α΅ˆα΅‰Λ‘)                              X = [x₁+PE₁, xβ‚‚+PEβ‚‚, ...]



Multi-Head Self-Attention Computation
─────────────────────────────────────

Input: X ∈ β„βΏΛ£α΅ˆα΅α΅’α΅ˆα΅‰Λ‘                           Single Head Attention
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Linear Projections                              Scaled Dot-Product Attention
Q = XW_Q ∈ β„βΏΛ£α΅ˆβ‚–                               Attention(Q,K,V) = softmax(QKα΅€/√dβ‚–)V
K = XW_K ∈ β„βΏΛ£α΅ˆβ‚–                                        β”‚
V = XW_V ∈ β„βΏΛ£α΅ˆα΅₯                                        β–Ό
     β”‚                                          Score Matrix: S = QKα΅€/√dβ‚–
     β–Ό                                          S[i,j] = similarity(query_i, key_j)
Parallel Processing (h=8 heads)                         β”‚
Head₁: Q₁, K₁, V₁ (dβ‚–=dα΅₯=64)                          β–Ό
Headβ‚‚: Qβ‚‚, Kβ‚‚, Vβ‚‚ (dβ‚–=dα΅₯=64)                   Attention Weights: A = softmax(S)
...                                            A[i,j] = how much position i attends to j
Headβ‚ˆ: Qβ‚ˆ, Kβ‚ˆ, Vβ‚ˆ (dβ‚–=dα΅₯=64)                          β”‚
     β”‚                                                 β–Ό
     β–Ό                                          Weighted Values: Z = AV
Concatenate Heads                               Z[i] = Ξ£β±Ό A[i,j] * V[j]
Z = Concat(head₁, ..., headβ‚ˆ)                          β”‚
     β”‚                                                 β–Ό
     β–Ό                                          Output ∈ β„βΏΛ£α΅ˆα΅₯
Final Projection
Output = ZW_O ∈ β„βΏΛ£α΅ˆα΅α΅’α΅ˆα΅‰Λ‘


Single Encoder Layer                           Computational Flow
────────────────────                          ─────────────────

Input: X ∈ β„βΏΛ£α΅ˆα΅α΅’α΅ˆα΅‰Λ‘                          X (sequence representation)
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Multi-Head Self-Attention                      Calculate attention scores
Attention(X, X, X)                            for all position pairs
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Residual Connection                            Add & Norm
X₁ = LayerNorm(X + Attention(X))              X₁ = LayerNorm(X + SubLayer₁(X))
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Position-wise FFN                             Two linear transformations
FFN(x) = max(0, xW₁ + b₁)Wβ‚‚ + bβ‚‚             ReLU activation between them
W₁ ∈ β„α΅ˆα΅α΅’α΅ˆα΅‰Λ‘Λ£α΅ˆff, Wβ‚‚ ∈ β„α΅ˆffΛ£α΅ˆα΅α΅’α΅ˆα΅‰Λ‘           Hidden dim: dff = 2048
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Residual Connection                            Add & Norm
Xβ‚‚ = LayerNorm(X₁ + FFN(X₁))                  Xβ‚‚ = LayerNorm(X₁ + SubLayerβ‚‚(X₁))
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Output: Xβ‚‚ ∈ β„βΏΛ£α΅ˆα΅α΅’α΅ˆα΅‰Λ‘                        Ready for next layer


Encoder Stack (N=6 layers)                    Information Flow
──────────────────────────                   ───────────────

Input Embeddings + Positional Encoding       Raw sequence information
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Encoder Layer 1                               Local attention patterns
Multi-Head Attention + FFN                    learned in first layer
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Encoder Layer 2                               More complex patterns
Multi-Head Attention + FFN                    building on previous layer
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Encoder Layer 3                               Hierarchical feature
Multi-Head Attention + FFN                    extraction continues
     β”‚                                               β”‚
     β–Ό                                               β–Ό
... (up to Layer 6)                          Deep semantic
                                             representations
     β”‚                                               β”‚
     β–Ό                                               β–Ό
Final Encoder Output                          Rich contextual
Z = [z₁, zβ‚‚, ..., zβ‚™] ∈ β„βΏΛ£α΅ˆα΅α΅’α΅ˆα΅‰Λ‘            representations


Decoder Layer                                 Masked Self-Attention
─────────────                                ─────────────────────

Target Sequence (Training)                   Attention Mask Matrix
y = [<START>, w₁, wβ‚‚, ..., wβ‚˜]               β”‚ 1 0 0 0 β”‚  ← position 1 only sees itself
     β”‚                                       β”‚ 1 1 0 0 β”‚  ← position 2 sees 1,2
     β–Ό                                       β”‚ 1 1 1 0 β”‚  ← position 3 sees 1,2,3
Shifted Right (Teacher Forcing)              β”‚ 1 1 1 1 β”‚  ← position 4 sees 1,2,3,4
Input: [<START>, w₁, wβ‚‚, ...]                       β”‚
Target: [w₁, wβ‚‚, w₃, ...]                          β–Ό
     β”‚                                       Prevents "looking ahead"
     β–Ό                                       during training
Masked Self-Attention                               β”‚
Only attend to previous positions                   β–Ό
     β”‚                                       Ensures autoregressive property
     β–Ό                                       is maintained during training
Encoder-Decoder Attention
Query: from decoder
Key, Value: from encoder output Z
     β”‚
     β–Ό
Position-wise FFN
Same as encoder FFN
     β”‚
     β–Ό
Output Predictions


Scaled Dot-Product Attention Computation
───────────────────────────────────────

Mathematical Formula                         Step-by-Step Calculation
──────────────────                         ────────────────────────

Attention(Q,K,V) = softmax(QKα΅€/√dβ‚–)V       Given:
                                           Q ∈ β„βΏΛ£α΅ˆβ‚– (queries)
Where:                                     K ∈ β„α΅Λ£α΅ˆβ‚– (keys)  
Q = queries                                V ∈ β„α΅Λ£α΅ˆα΅₯ (values)
K = keys                                          β”‚
V = values                                        β–Ό
dβ‚– = key dimension                         Step 1: Compute dot products
                                          S = QKα΅€ ∈ ℝⁿˣᡐ
Scaling Factor: 1/√dβ‚–                     S[i,j] = Ξ£β‚– Q[i,k] * K[j,k]
Why? Prevents saturation of softmax               β”‚
when dβ‚– is large                                  β–Ό
                                          Step 2: Scale by √dβ‚–
Example with dβ‚–=64:                       S_scaled = S / √64 = S / 8
√dβ‚– = √64 = 8                                     β”‚
                                                  β–Ό
                                          Step 3: Apply softmax
                                          A[i,j] = exp(S_scaled[i,j]) / Ξ£β‚– exp(S_scaled[i,k])
                                                  β”‚
                                                  β–Ό
                                          Step 4: Weighted sum of values
                                          Output[i] = Ξ£β±Ό A[i,j] * V[j]


Multi-Head Attention Detailed Computation
────────────────────────────────────────

Single Model vs Multi-Head                  Parallel Head Processing
─────────────────────                       ─────────────────────

Single Attention (dβ‚˜β‚’dβ‚‘β‚—=512)                Head 1 (dβ‚–=dα΅₯=64):
All 512 dimensions together                  Q₁ = XW₁Q, K₁ = XW₁K, V₁ = XW₁V
Limited representation power                 A₁ = Attention(Q₁, K₁, V₁)
     β”‚                                              β”‚
     β–Ό                                              β–Ό
Multi-Head (h=8, dβ‚–=dα΅₯=64)                 Head 2 (dβ‚–=dα΅₯=64):
Each head: 512/8 = 64 dimensions           Qβ‚‚ = XWβ‚‚Q, Kβ‚‚ = XWβ‚‚K, Vβ‚‚ = XWβ‚‚V
Different representation subspaces          Aβ‚‚ = Attention(Qβ‚‚, Kβ‚‚, Vβ‚‚)
     β”‚                                              β”‚
     β–Ό                                              β–Ό
Benefits:                                   ... (parallel computation)
- Attend to different types of info                β”‚
- Different positions simultaneously              β–Ό
- Syntactic vs semantic relationships      Head 8 (dβ‚–=dα΅₯=64):
                                          Qβ‚ˆ = XWβ‚ˆQ, Kβ‚ˆ = XWβ‚ˆK, Vβ‚ˆ = XWβ‚ˆV
Concatenation:                            Aβ‚ˆ = Attention(Qβ‚ˆ, Kβ‚ˆ, Vβ‚ˆ)
MultiHead = Concat(A₁, Aβ‚‚, ..., Aβ‚ˆ)              β”‚
Final projection: MultiHeadWβ‚€                    β–Ό
                                          Final Output:
                                          Concat(A₁, Aβ‚‚, ..., Aβ‚ˆ)Wβ‚€


Position Encoding Necessity                 Sinusoidal Position Encoding
─────────────────────────                  ──────────────────────────

Problem: Attention is permutation          Formula:
equivariant                                PE(pos,2i) = sin(pos/10000^(2i/dβ‚˜β‚’dβ‚‘β‚—))
                                          PE(pos,2i+1) = cos(pos/10000^(2i/dβ‚˜β‚’dβ‚‘β‚—))
"cat sat mat" vs "mat sat cat"                    β”‚
Same attention weights!                           β–Ό
     β”‚                                     Properties:
     β–Ό                                     - Different wavelengths for each dim
Need position information                  - Relative position relationships
     β”‚                                     - Can handle variable lengths
     β–Ό                                            β”‚
Two options:                                      β–Ό
1. Learned embeddings                      Example (dβ‚˜β‚’dβ‚‘β‚—=4):
2. Sinusoidal encoding                     pos=0: [sin(0/10000^0), cos(0/10000^0),
     β”‚                                             sin(0/10000^0.5), cos(0/10000^0.5)]
     β–Ό                                     pos=1: [sin(1/10000^0), cos(1/10000^0),
Transformer choice: Sinusoidal                    sin(1/10000^0.5), cos(1/10000^0.5)]
Reason: Generalization to longer sequences        β”‚
                                                  β–Ό
                                          Added to word embeddings:
                                          Final_input = Word_emb + Pos_emb


Training Process                            Loss Computation
───────────────                            ─────────────────

Teacher Forcing Mode                       Cross-Entropy Loss
Input:  [<START>, "Hello", "world"]       Given target: ["Hello", "world", <END>]
Target: ["Hello", "world", <END>]         Predicted: [P₁, Pβ‚‚, P₃]
     β”‚                                           β”‚
     β–Ό                                           β–Ό
Parallel Training                          Loss = -Ξ£α΅’ log(Pα΅’[correct_tokenα΅’])
All positions computed simultaneously      Where Pα΅’ is probability distribution
(unlike RNN sequential)                    over vocabulary at position i
     β”‚                                           β”‚
     β–Ό                                           β–Ό
Decoder Output:                           Label Smoothing (Ξ΅=0.1):
Logits ∈ β„α΅›α΅’αΆœα΅ƒα΅‡ for each position       Instead of hard targets [0,0,1,0,...]
     β”‚                                   Use soft: [0.025, 0.025, 0.9, 0.025,...]
     β–Ό                                          β”‚
Softmax β†’ Probabilities                        β–Ό
P = softmax(logits)                       Improves generalization
     β”‚                                   Prevents overconfidence
     β–Ό
Compute loss with ground truth


Inference Process (Autoregressive)         Step-by-Step Generation
─────────────────────────────              ──────────────────────

Different from Training!                   Step 1: Start token
No teacher forcing                         Input: [<START>]
     β”‚                                     Decoder output: P₁
     β–Ό                                     Sample: w₁ = argmax(P₁)
Sequential Generation                             β”‚
Each step depends on previous                     β–Ό
     β”‚                                     Step 2: Extend sequence
     β–Ό                                     Input: [<START>, w₁]
Step 1:                                   Decoder output: Pβ‚‚
Input: [<START>]                          Sample: wβ‚‚ = argmax(Pβ‚‚)
Output: w₁                                       β”‚
     β”‚                                           β–Ό
     β–Ό                                     Step 3: Continue...
Step 2:                                   Input: [<START>, w₁, wβ‚‚]
Input: [<START>, w₁]                      Output: w₃
Output: wβ‚‚                                       β”‚
     β”‚                                           β–Ό
     β–Ό                                     Until: <END> token generated
Continue until <END>                      or max_length reached
     β”‚
     β–Ό
Final sequence: [w₁, wβ‚‚, ..., wβ‚™]

Efficiency Note:
Can reuse previous computations
with KV-caching in practice






Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • 4D Vision Foundation Models - 25
  • State Spaces Models - 25
  • Neuroscience Basis For Hearing & Vision - 25
  • Speech Processing x Transfer Learning - 25
  • Models Post-Training - 25