State Spaces Models - 25
RNN vs. Mamba vs. Transformer
Traditional RNN Transformer Mamba (Selective SSM)
βββββββββββββ βββββββββββββ ββββββββββββββββββββββ
Sequential Processing Parallel Processing Selective Parallel Processing
hβ β hβ β ... β hβ All positions simultaneously Content-aware selective scan
(SLOW, O(n) steps) (FAST, O(1) steps but O(nΒ²) memory) (FAST, O(n) steps, O(n) memory)
β β β
βΌ βΌ βΌ
Fixed State Compression No Compression (KV Cache) Selective State Compression
Cannot filter content Stores everything Can focus/filter dynamically
β β β
βΌ βΌ βΌ
Linear Time Complexity Quadratic Time Complexity Linear Time Complexity
But limited effectiveness Effective but expensive Both effective AND efficient
Efficiency vs. Effectiveness Tradeoff
Sequence Modeling Challenge
βββββββββββββββββββββββββββ
Efficiency (Speed/Memory) βββββ TRADEOFF βββββ Effectiveness (Quality)
β β
βΌ βΌ
RNNs: O(n) time Transformers: O(nΒ²) time
Fixed compression Perfect information retention
Limited context Unlimited context (within window)
β β
βββββββββββββββ MAMBA SOLUTION βββββββββββββββββ
β
βΌ
Selective Compression
O(n) time + Transformer quality
Traditional SSM (S4) Mamba Selective SSM
ββββββββββββββββββββ ββββββββββββββββββββ
Time-Invariant Parameters Input-Dependent Parameters
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Ξ, A, B, C = const β β Ξ(x), B(x), C(x) β
β for all timesteps β β β = f(input_content) β
βββββββββββββββββββββββ βββββββββββββββββββββββ
β β
βΌ βΌ
No Content Awareness Content-Aware Selection
βββββββββββββββββββββββ βββββββββββββββββββββββ
β Cannot distinguish β β Can selectively β
β important vs noise β β focus/ignore based β
β information β β on content β
βββββββββββββββββββββββ βββββββββββββββββββββββ
β β
βΌ βΌ
Global Convolution Parallel Scan + Hardware Fusion
O(n log n) computation O(n) computation
Mamba Block Processing Pipeline
βββββββββββββββββββββββββββββββ
Input Sequence x = [xβ, xβ, ..., xβ]
β
βΌ
Step 1: Input Projections
βββββββββββββββββββββββββββββββββββ
β x β Linear projections β
β Generate: Ξ, B, C parameters β
β Ξ_t = softplus(Linear(x_t)) β
β B_t = Linear_N(x_t) β
β C_t = Linear_N(x_t) β
βββββββββββββββββββββββββββββββββββ
β
βΌ
Step 2: Selective Discretization
βββββββββββββββββββββββββββββββββββ
β Convert continuous β discrete β
β A_bar_t = exp(Ξ_t * A) β
β B_bar_t = Ξ_t * B_t β
β Now parameters adapt to input! β
βββββββββββββββββββββββββββββββββββ
β
βΌ
Step 3: Selective State Update
βββββββββββββββββββββββββββββββββββ
β h_t = A_bar_t * h_{t-1} + β
β B_bar_t * x_t β
β (Computed via parallel scan) β
βββββββββββββββββββββββββββββββββββ
β
βΌ
Step 4: Output Generation
βββββββββββββββββββββββββββββββββββ
β y_t = C_t * h_t β
β Selective output projection β
βββββββββββββββββββββββββββββββββββ
β
βΌ
Final Output: y = [yβ, yβ, ..., yβ]
Performance Comparison
Model Architecture Comparison
ββββββββββββββββββββββββββββββ
Metric RNN Transformer Mamba
ββββββ βββ βββββββββββ βββββ
Time Complexity O(n) O(nΒ²) O(n)
Space Complexity O(1) O(nΒ²) O(n)
Parallelizable β β β
Long Context β β β
Content Selectivity β Limited β
Hardware Efficient β β β
Performance Results (1.3B parameters):
βββββββββββββββββββββββββββββββββββββ
Language Modeling 8.5 PPL 8.0 PPL 8.0 PPL
Training Speed 1.0Γ 1.0Γ 1.1Γ
Inference Speed 1.0Γ 1.0Γ 5.0Γ
Memory Usage Low High Medium
Traditional Transformer SSM (Mamba)
βββββββββββββββββββββββ βββββββββββ
Attention Mechanism: Selective State Space:
Attention(Q,K,V) = softmax(QK^T/βd)V h_t = A_barΒ·h_{t-1} + B_barΒ·x_t
y_t = CΒ·h_t
Complexity: O(nΒ²) Complexity: O(n)
Memory: O(nΒ²) Memory: O(n)
Context: Limited by memory Context: Unlimited
Continuous-Time System (Theory) Discrete-Time System (Practice)
βββββββββββββββββββββββββββββββ ββββββββββββββββββββββββββββββββ
Differential Equations Difference Equations
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
β h'(t) = Ah(t) + Bx(t) β β β h_t = A_barΒ·h_{t-1} + B_barΒ·x_t β
β y(t) = Ch(t) β β y_t = CΒ·h_t β
β β β β
β Smooth, continuous flow β β Step-by-step computation β
β Infinite precision β β Digital computer friendly β
β Mathematical elegance β β Practical implementation β
βββββββββββββββββββββββββββββββ βββββββββββββββββββββββββββββββββββ
Parameter Transformation:
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β A: State transition matrix β A_bar = exp(ΞΒ·A) β
β B: Input projection matrix β B_bar = (ΞΒ·A)^(-1)(exp(ΞΒ·A)-I)Β·ΞΒ·B β
β C: Output projection matrix β C: Remains unchanged β
β Ξ: Discretization step size β Controls temporal resolution β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Reservoir State Evolution Over Time
βββββββββββββββββββββββββββββββββββ
Time: t=0, t=1, t=2, t=3, ... (each step = Ξ seconds)
t=0: Initial State
βββββββββββββββββββββββββββββββ
β h_0 = 100 liters (starting) β
β x_0 = 20 liters input β
β y_0 = C Γ h_0 = output β
βββββββββββββββββββββββββββββββ
β
t=1: First Update
βββββββββββββββββββββββββββββββ
β h_1 = A_barΓ100 + B_barΓ20 β
β x_1 = 15 liters input β
β y_1 = C Γ h_1 = output β
βββββββββββββββββββββββββββββββ
β
t=2: Second Update
βββββββββββββββββββββββββββββββ
β h_2 = A_barΓh_1 + B_barΓ15 β
β x_2 = 25 liters input β
β y_2 = C Γ h_2 = output β
βββββββββββββββββββββββββββββββ
β
Continue step by step...
Core Magic: Ξ (Time Step)
How often do we update?
-> Small Ξ: Frequent updates, more precise, heavy computation
-> Large Ξ: Fewer updates, efficient, may lose details
π§ Neural Networks & AI:
Continuous Theory β Discrete Implementation
Perfect gradients β Backpropagation steps
Infinite precision β Digital computation
π₯ Video Processing:
Smooth motion β Frame-by-frame analysis
Continuous scenes β Discrete timesteps
Real-time flow β Processable chunks
π° Financial Systems:
Continuous markets β **Tick-by-tick** updates
Instant prices β Sampled data points
Perfect info β **Practical decisions**
π Engineering Systems:
Fluid dynamics β Digital simulations
Smooth flows β Computational fluid dynamics
Physical reality β Numerical solutions
References
HCI
- Chatbots + Speech Processing
Some parellel Attention calculation
- π MEMORY - Transformers vs. RNN / LSTM
- Add Reflection - 2024 - You Only Cache Once: Decoder-Decoder Architectures for Language Models
- RetNet - Retention Network -> Gated Retention
- 2023 - RetNet: Retinal Disease Detection using Convolutional Neural Network
- DeltaNet - 2025 - Parallelizing Linear Transformers with the Delta Rule over Sequence Length
Enjoy Reading This Article?
Here are some more articles you might like to read next: