AI Model Structures - 25

Welcome,

Let’s take a look at the history of The Model Structures we’re using today.

1. Why「Deep Structure」

Compared with the original machine learning models:

  • Linear Regression / SVM / Shallow Decision Trees

Deep structures refer to neural networks with Multiple Layers of Nonlinear Transformations


📍 Content

  • Transformer
  • BERT
  • Mamba
  • GPT
  • Tokenization
  • ARIMA
  • RNN
  • Diffusion Models
  • Flow Matching
  • TensorRT
  • Quantization - LoRA + QLoRA 📍


These deep models are capable of Learning Hierarchical Features, where each layer captures increasingly abstract representations of the data.


Local Minimal vs. Saddle Point


Knowledge Map


In practice, “Deep” means:

  • More than 3–4 layers in fully connected networks

  • 10+ layers in convolutional networks

  • Or even hundreds of layers in modern transformers like GPT and BERT



2. Key Tech

  • MLP
    • Multilayer Perceptron
    • Feedforward fully connected networks
    • Used in classification, regression, or small-scale tabular/audio tasks
    • 1989 - Universal Approximation Theorem / Still used as light head in multimodal systems





  • CNN
    • Convolutional Neural Networks
    • When inputs are images or grid-like data
    • Extracts spatial features, widely used in image/audio tasks
    • Fully Connected Layer -> Receptive Field -> Parameter Sharing -> Convolutional Layer
    • 1998 - LeNet / 2012 - AlexNet: ImageNet Classification with Deep Convolutional Neural Networks













  • Diffusion Based
    • Gradual denoising process to generate samples from noise
    • Currently SoTA in image and speech generation
    • Training is stable, generation is slow
    • In Diffusion
      • The model learns to reverse noise through a pre-defined noise schedule
      • It does not evaluate or penalize each intermediate step
      • There is no “fitness score” like in genetic algorithms
    • In genetic algorithms - GA







3. Some Norms and Their Nature

CTC - Connectionist Temporal Classification - is a loss function used for sequence tasks where input and output lengths don’t match — like speech-to-text

  • You don’t need exact alignment between audio frames and text
  • CTC learns to map long input sequences (e.g. 1000 audio frames) to short outputs (e.g. “hello”)
  • It introduces a special blank token - ∅ to allow flexible alignment
  • The model can output repeated characters + blanks, and CTC will collapse them into the final label

– Input frames: [x1, x2, x3, x4, x5, x6, x7, x8] Model output: ∅ h ∅ e l l ∅ o CTC collapse: → “hello” –



📍 Why LSTM / other RNN Layer after Self-Attention Layer

  • Streamability
  • Positional bias
  • Smoothing
  • Lightweight after quantization
  • Distillation bridge
  • TLDR - Attention offers Global Context, the follow-up LSTM supplies Sequential Inertia, Latency Control, and Quantization-friendly compression—ideal for hearing-aid ASR



4. Some References








Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Neuroscience Basis For Hearing & Vision - 25
  • AI Character Consistency x Memory - 25
  • Mathematical Fractals
  • Structures for Speech Processing - 25
  • Speech Processing x Transfer Learning - 25