State Spaces Models - 25


2023 - Mamba

2023 - RWKV



RNN vs. Mamba vs. Transformer

Traditional RNN                              Transformer                                Mamba (Selective SSM)
─────────────                               ─────────────                              ──────────────────────
Sequential Processing                       Parallel Processing                        Selective Parallel Processing
h₁ β†’ hβ‚‚ β†’ ... β†’ hβ‚™                         All positions simultaneously              Content-aware selective scan
(SLOW, O(n) steps)                         (FAST, O(1) steps but O(nΒ²) memory)     (FAST, O(n) steps, O(n) memory)
     β”‚                                           β”‚                                          β”‚
     β–Ό                                           β–Ό                                          β–Ό
Fixed State Compression                     No Compression (KV Cache)                Selective State Compression
Cannot filter content                       Stores everything                        Can focus/filter dynamically
     β”‚                                           β”‚                                          β”‚
     β–Ό                                           β–Ό                                          β–Ό
Linear Time Complexity                      Quadratic Time Complexity                Linear Time Complexity
But limited effectiveness                   Effective but expensive                  Both effective AND efficient


Efficiency vs. Effectiveness Tradeoff

Sequence Modeling Challenge
───────────────────────────

Efficiency (Speed/Memory)  ←──── TRADEOFF ────→  Effectiveness (Quality)
        β”‚                                              β”‚
        β–Ό                                              β–Ό
RNNs: O(n) time                                Transformers: O(nΒ²) time
Fixed compression                              Perfect information retention
Limited context                                Unlimited context (within window)
        β”‚                                              β”‚
        └────────────── MAMBA SOLUTION β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              β”‚
                              β–Ό
                    Selective Compression
                    O(n) time + Transformer quality


Traditional SSM (S4)                         Mamba Selective SSM
────────────────────                        ────────────────────

Time-Invariant Parameters                   Input-Dependent Parameters
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Ξ”, A, B, C = const β”‚                    β”‚ Ξ”(x), B(x), C(x)    β”‚
β”‚  for all timesteps  β”‚          β†’         β”‚ = f(input_content)  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                           β”‚
         β–Ό                                           β–Ό
No Content Awareness                        Content-Aware Selection
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Cannot distinguish  β”‚                    β”‚ Can selectively     β”‚
β”‚ important vs noise  β”‚                    β”‚ focus/ignore based  β”‚
β”‚ information         β”‚                    β”‚ on content          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚                                           β”‚
         β–Ό                                           β–Ό
Global Convolution                          Parallel Scan + Hardware Fusion
O(n log n) computation                      O(n) computation


Mamba Block Processing Pipeline
───────────────────────────────

Input Sequence x = [x₁, xβ‚‚, ..., xβ‚™]
         β”‚
         β–Ό
Step 1: Input Projections
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ x β†’ Linear projections          β”‚
β”‚ Generate: Ξ”, B, C parameters    β”‚
β”‚ Ξ”_t = softplus(Linear(x_t))     β”‚
β”‚ B_t = Linear_N(x_t)             β”‚  
β”‚ C_t = Linear_N(x_t)             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
Step 2: Selective Discretization  
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Convert continuous β†’ discrete   β”‚
β”‚ A_bar_t = exp(Ξ”_t * A)          β”‚
β”‚ B_bar_t = Ξ”_t * B_t             β”‚
β”‚ Now parameters adapt to input!  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
Step 3: Selective State Update
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ h_t = A_bar_t * h_{t-1} +       β”‚
β”‚       B_bar_t * x_t             β”‚
β”‚ (Computed via parallel scan)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
Step 4: Output Generation
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ y_t = C_t * h_t                 β”‚
β”‚ Selective output projection     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
Final Output: y = [y₁, yβ‚‚, ..., yβ‚™]


Performance Comparison

Model Architecture Comparison
──────────────────────────────

Metric              RNN        Transformer    Mamba
──────              ───        ───────────    ─────
Time Complexity     O(n)       O(nΒ²)          O(n)
Space Complexity    O(1)       O(nΒ²)          O(n)
Parallelizable      βœ—          βœ“              βœ“
Long Context        βœ—          βœ“              βœ“
Content Selectivity βœ—          Limited        βœ“
Hardware Efficient  βœ—          βœ—              βœ“

Performance Results (1.3B parameters):
─────────────────────────────────────
Language Modeling   8.5 PPL    8.0 PPL        8.0 PPL
Training Speed      1.0Γ—       1.0Γ—           1.1Γ—
Inference Speed     1.0Γ—       1.0Γ—           5.0Γ—
Memory Usage        Low        High           Medium


Traditional Transformer                    SSM (Mamba)
───────────────────────                    ───────────

Attention Mechanism:                      Selective State Space:
Attention(Q,K,V) = softmax(QK^T/√d)V      h_t = A_bar·h_{t-1} + B_bar·x_t
                                          y_t = CΒ·h_t

Complexity: O(nΒ²)                         Complexity: O(n)
Memory: O(nΒ²)                            Memory: O(n)
Context: Limited by memory               Context: Unlimited


Continuous-Time System (Theory)              Discrete-Time System (Practice)
───────────────────────────────              ────────────────────────────────

Differential Equations                       Difference Equations
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ h'(t) = Ah(t) + Bx(t)       β”‚      β†’      β”‚ h_t = A_barΒ·h_{t-1} + B_barΒ·x_t β”‚
β”‚ y(t) = Ch(t)                β”‚             β”‚ y_t = CΒ·h_t                     β”‚
β”‚                             β”‚             β”‚                                 β”‚
β”‚ Smooth, continuous flow     β”‚             β”‚ Step-by-step computation        β”‚
β”‚ Infinite precision          β”‚             β”‚ Digital computer friendly       β”‚
β”‚ Mathematical elegance       β”‚             β”‚ Practical implementation        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Parameter Transformation:
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ A: State transition matrix    β†’    A_bar = exp(Δ·A)                     β”‚
β”‚ B: Input projection matrix    β†’    B_bar = (Δ·A)^(-1)(exp(Δ·A)-I)·Δ·B   β”‚
β”‚ C: Output projection matrix   β†’    C: Remains unchanged                 β”‚
β”‚ Ξ”: Discretization step size   β†’    Controls temporal resolution         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


Reservoir State Evolution Over Time
───────────────────────────────────

Time: t=0, t=1, t=2, t=3, ... (each step = Ξ” seconds)

t=0: Initial State
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ h_0 = 100 liters (starting) β”‚
β”‚ x_0 = 20 liters input       β”‚
β”‚ y_0 = C Γ— h_0 = output      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          ↓
t=1: First Update  
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ h_1 = A_barΓ—100 + B_barΓ—20  β”‚
β”‚ x_1 = 15 liters input       β”‚
β”‚ y_1 = C Γ— h_1 = output      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          ↓
t=2: Second Update
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ h_2 = A_barΓ—h_1 + B_barΓ—15  β”‚
β”‚ x_2 = 25 liters input       β”‚
β”‚ y_2 = C Γ— h_2 = output      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
          ↓
Continue step by step...

Core Magic: Ξ” (Time Step)
How often do we update?
-> Small Ξ”: Frequent updates, more precise, heavy computation
-> Large Ξ”: Fewer updates, efficient, may lose details


🧠 Neural Networks & AI:
Continuous Theory β†’ Discrete Implementation
Perfect gradients β†’ Backpropagation steps
Infinite precision β†’ Digital computation

πŸŽ₯ Video Processing:
Smooth motion β†’ Frame-by-frame analysis
Continuous scenes β†’ Discrete timesteps
Real-time flow β†’ Processable chunks

πŸ’° Financial Systems:
Continuous markets β†’ **Tick-by-tick** updates
Instant prices β†’ Sampled data points
Perfect info β†’ **Practical decisions**

🌊 Engineering Systems:
Fluid dynamics β†’ Digital simulations
Smooth flows β†’ Computational fluid dynamics
Physical reality β†’ Numerical solutions





References



HCI


  • Chatbots + Speech Processing


Some parellel Attention calculation


  • πŸ“ MEMORY - Transformers vs. RNN / LSTM











Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • 4D Vision Foundation Models - 25
  • Neuroscience Basis For Hearing & Vision - 25
  • Structures for Speech Processing - 25
  • Speech Processing x Transfer Learning - 25
  • Models Post-Training - 25