AI Model Structures - 25
Welcome,
Let’s take a look at the history of The Model Structures we’re using today.
1. Why「Deep Structure」
Compared with the original machine learning models:
- Linear Regression / SVM / Shallow Decision Trees
Deep structures refer to neural networks with Multiple Layers of Nonlinear Transformations
📍 Content
- Transformer
- BERT
- Mamba
- GPT
- Tokenization
- ARIMA
- RNN
- Diffusion Models
- Flow Matching
- TensorRT
- Quantization - LoRA + QLoRA 📍
These deep models are capable of Learning Hierarchical Features, where each layer captures increasingly abstract representations of the data.
Local Minimal vs. Saddle Point
In practice, “Deep” means:
-
More than 3–4 layers in fully connected networks
-
10+ layers in convolutional networks
-
Or even hundreds of layers in modern transformers like GPT and BERT
2. Key Tech
- MLP
- Multilayer Perceptron
- Feedforward fully connected networks
- Used in classification, regression, or small-scale tabular/audio tasks
- 1989 - Universal Approximation Theorem / Still used as light head in multimodal systems
- RNN -> LSTM
- When inputs are sequences
- Hochreiter & Schmidhuber 1997 - LSTM
- When inputs are sequences
- CNN
- Convolutional Neural Networks
- When inputs are images or grid-like data
- Extracts spatial features, widely used in image/audio tasks
- Fully Connected Layer -> Receptive Field -> Parameter Sharing -> Convolutional Layer
- 1998 - LeNet / 2012 - AlexNet: ImageNet Classification with Deep Convolutional Neural Networks
- Transformer
- When inputs are sequences
- Self-attention + Parallel computation
- 2015 ICLR - Neural Machine Translation by Jointly Learning to Align and Translate - Additive Attention
- 2017 NeuralPS - Attention Is All You Need - Self-Attention / Scaled Dot-Product Attention
- When inputs are sequences
- 📍 Mamba
- Linear-Time Sequence Modeling
- State Space Model - SSM - with selective long-range memory
- 2023 - Mamba: Linear-Time Sequence Modeling with Selective State Spaces
- Linear-Time Sequence Modeling
- BERT
- Bidirectional Encoder Representations from Transformers
- using Masked language modeling
- 2019 - BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- Bidirectional Encoder Representations from Transformers
- Conformer
- Convolution + Transformer = Conformer
- Combines local - CNN - and Global Self-attention Features
- Widely used in speech recognition tasks
- 2020 - Conformer: Convolution-augmented Transformer for Speech Recognition
- GAN
- Generator vs Discriminator
- Generates Images, Audio
- Popular in TTS, audio enhancement, and image generation
- 2014 - Generative Adversarial Nets
- Generator vs Discriminator
- Diffusion Based
- Gradual denoising process to generate samples from noise
- Currently SoTA in image and speech generation
- Training is stable, generation is slow
- In Diffusion
- The model learns to reverse noise through a pre-defined noise schedule
- It does not evaluate or penalize each intermediate step
- There is no “fitness score” like in genetic algorithms
- In genetic algorithms - GA
- Every candidate (individual) is evaluated using a fitness function
- Poor candidates are penalized or discarded
- 2020 - Denoising Diffusion Probabilistic Models
- Gradual denoising process to generate samples from noise
- 📍 SSL
- Learns from unlabeled data by solving pretext tasks
- Strong performance in low-resource and zero-shot setups
- 2020 - wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
- Learns from unlabeled data by solving pretext tasks
- 📍 MEMORY - Transformers vs. RNN / LSTM
- Add Reflection - 2024 - You Only Cache Once: Decoder-Decoder Architectures for Language Models
- RetNet - Retention Network -> Gated Retention
- 2023 - RetNet: Retinal Disease Detection using Convolutional Neural Network
- DeltaNet - 2025 - Parallelizing Linear Transformers with the Delta Rule over Sequence Length
- Add Reflection - 2024 - You Only Cache Once: Decoder-Decoder Architectures for Language Models
3. Some Norms and Their Nature
CTC - Connectionist Temporal Classification - is a loss function used for sequence tasks where input and output lengths don’t match — like speech-to-text
- You don’t need exact alignment between audio frames and text
- CTC learns to map long input sequences (e.g. 1000 audio frames) to short outputs (e.g. “hello”)
- It introduces a special blank token - ∅ to allow flexible alignment
- The model can output repeated characters + blanks, and CTC will collapse them into the final label
– Input frames: [x1, x2, x3, x4, x5, x6, x7, x8] Model output: ∅ h ∅ e l l ∅ o CTC collapse: → “hello” –
📍 Why LSTM / other RNN Layer after Self-Attention Layer
- Streamability
- Positional bias
- Smoothing
- Lightweight after quantization
- Distillation bridge
- TLDR - Attention offers Global Context, the follow-up LSTM supplies Sequential Inertia, Latency Control, and Quantization-friendly compression—ideal for hearing-aid ASR
4. Some References
Enjoy Reading This Article?
Here are some more articles you might like to read next: