Generative Models - 26

Welcome,

Let’s take a look at the history of Generative Models for Science



1. References

1.1 2024 NeurlPS tutorial - Flow Matching for Generative Modeling x 2024 Flow Matching Guide and Code

1.2 2023 Generative Models: An Interdisciplinary Perspective

1.3 2021 Deep Generative Modelling: A Comparative Review of VAEs, GANs, Normalizing Flows, Energy-Based and Autoregressive Models



2. Optical Experiments Simulation by Diffusion Models

2.1 2025 Tunable Optimally-Coded Snapshot Hyperspectral Imaging for Scene Adaptation

2.2 2023 Binarized Spectral Compressive Imaging

2.3 2022 Mask-Guided Spectral-Wise Transformer for Efficient Hyperspectral Image Reconstruction



3. Technical History of Today’s GenAI Models



4. A Coding Sample from NeurlPS 2023 - Binarized Spectral Compressive Imaging

4.1 The Github Project A Toolbox for Binarized Spectral Compressive Imaging



5. Others

Topics


1. Multimodal Alignment

  • cross-attention
  • contrastive learning
  • CLIP - ViT + Transformer


2. Extreme Sparsity Self-Supervised Learning


Global Attention: O((HΒ·W)Β²) - Every pixel talks to every pixel
Mamba Scans: O(HΒ·W) - Linear recurrence per direction



Flow Matching Loss


Purpose
Enforce temporal smoothness by aligning latent representations of adjacent frames

Definition

\[\mathcal{L}_{\mathrm{flow}} = \bigl\lVert z_{t+1} - \mathrm{warp}(z_t, f_{t\to t+1}) \bigr\rVert_{1}\]
  • $z_t, z_{t+1}$: latent features of frame $t$ and frame $t+1$
  • $\mathrm{warp}(z_t, f_{t \to t+1})$: features $z_t$ warped by the predicted flow field $f_{t \to t+1}$
  • $f_{t \to t+1}$: optical flow field predicted by a lightweight network


  • Domain
    • Flow Matching applies to video frames (temporal consistency)
  • Alignment Target
    • Flow Matching aligns adjacent frames’ latent features
  • Warping Operation
    • Flow Matching includes a warp based on optical flow
  • Goal
    • Flow Matching improves frame-to-frame coherence in generated video




References

1. Spatiotemporal Video Encoding

  • Feichtenhofer, C., Fan, H., Malik, J., & He, K. β€œSlowFast Networks for Video Recognition.” CVPR 2019.

  • Capture both spatial detail (handshape, body posture) and temporal dynamics (movement trajectories)
  • Common backbones include ViViT- or TimeSformer-style vision Transformers and 3D CNNs such as SlowFast or MViT

2. Pose & Keypoint Extraction

  • Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S.-E., & Sheikh, Y. β€œOpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.” CVPR 2017.

  • Real-time sign recognition often begins by detecting hand and body keypoints with tools like MediaPipe or OpenPose
  • These skeletal sequences are then encoded by Graph Convolutional Networks (GCNs) or lightweight Transformers, reducing raw‐pixel complexity

πŸ“ 3. Multimodal Alignment & Fusion

  • Transformer, ViLBERT

  • Aligning visual signals (hand motion, appearance) with text tokens relies on robust cross-attention mechanisms
  • Architectures employ encoder–decoder attention, co-attention blocks, and both early- and late-fusion strategies to integrate vision and language features

5. Adapter & Distillation Methods

  • The Gemma family uses adapter modules to fine-tune large models with minimal overhead. Techniques like LoRA, prefix-tuning, and adapter fusion inject low-rank updates into fixed backbones
  • Cross-modal distillationβ€”combining KL divergence on output distributions with hidden-state alignmentβ€”transfers knowledge from teacher to student efficiently

6. Quantization & On-Device Compression

  • SignGemma targets mobile and edge deployment, requiring post-training quantization (8-bit, 4-bit), mixed-precision (FP16/BF16), and optimized transformer kernels (e.g., FlashAttention)
  • Mastery of these methods ensures model size and compute footprint fit constrained hardware

7. Multilingual & πŸ“ Zero-Shot Transfer

  • mBERT(Multilingual BERTοΌ‰οΌŒDevlin et al. β€œBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019.

  • Supporting multiple sign languages (e.g., ASL, BSL) demands domain adaptation strategies
  • Key approaches include multilingual pre-training, adapter-based language extension, and contrastive video–text learning to enable zero-shot transfer across language variants






CLIP

[Input Image (224Γ—224Γ—3)] β†’ [Vision Encoder: ViT-B/32 or RN50] β†’ [Image Features (512)]
         ↓                                                              ↓
[Resize + CenterCrop]                                        [Layer Normalization]
         ↓                                                              ↓
[ToTensor + Normalize]                                       [Linear Projection Matrix]
         ↓                                                              ↓
[CLIP Transform Pipeline]                                    [L2 Normalization]
                                                                       ↓
                                          [InfoNCE Loss with Learnable Temperature]
                                                    ↑           ↓
                                        [torch.matmul(I_f, T_f.T) * exp(Ο„)]
                                                    ↑
[Input Text (max 77 tokens)] β†’ [Text Encoder: Transformer] β†’ [Text Features (512)]
         ↓                                                              ↓
[ftfy.fix_text() + html.unescape()]                         [Layer Normalization]
         ↓                                                              ↓
[BPE Tokenization (49152 vocab)]                            [Linear Projection Matrix]
         ↓                                                              ↓
[Add [SOS]/[EOS] tokens + Pad]                             [L2 Normalization]



High-Level Topics

Cross-Modal Alignment
β”œβ”€β”€ Core Architectures
β”‚   β”œβ”€β”€ Dual-Encoder Model
β”‚   β”‚   └─ CLIP (Contrastive Language–Image Pre-training)  
β”‚   β”œβ”€β”€ Multimodal Transformers
β”‚   β”‚   └─ ViLBERT, UNITER, FLAVA, Florence  
β”‚   β”œβ”€β”€ Co-Attention Networks
β”‚   β”‚   └─ VisualBERT, LXMERT  
β”‚   └── Graph-Based Alignment
β”‚       └─ MMG (Multimodal Graph), VLP-GNN  
β”‚
β”œβ”€β”€ Pretraining & Losses
β”‚   β”œβ”€β”€ Contrastive Learning  
β”‚   β”‚   └─ InfoNCE, LoCL, SLiC (Hard Negative Mining)  
β”‚   β”œβ”€β”€ Masked Multimodal Modeling  
β”‚   β”‚   └─ VideoMAE+Text, MM-BEiT  
β”‚   β”œβ”€β”€ Image–Text Matching (ITM) 
β”‚   β”‚   └─ alignment head + triplet loss  
β”‚   └── Cross-Modal Distillation  
β”‚       └─ CLIP Distill, Florenceβ†’TinyFlorence  
β”‚
β”œβ”€β”€ Attention Mechanisms
β”‚   β”œβ”€β”€ Cross-Attention 
β”‚   β”œβ”€β”€ Co-Attention 
β”‚   β”œβ”€β”€ Hierarchical Attention 
β”‚   └── Efficient Attention 
β”‚       └─ Performer, Linformer, FlashAttention  
β”‚
β”œβ”€β”€ Advanced Techniques
β”‚   β”œβ”€β”€ Adapter & Prefix-Tuning
β”‚   β”‚   └─ LoRA, AdapterFusion  
β”‚   β”œβ”€β”€ Prompt-Based Alignment 
β”‚   β”‚   └─ CoOp, MaPLe  
β”‚   β”œβ”€β”€ Multimodal Fusion Strategies
β”‚   β”‚   └─ Early-Fusion / Late-Fusion / Joint-Fusion  
β”‚   └── Self-Supervised Alignmen  
β”‚       └─ SLIP, CLIP-style Vision–Speech, VideoCLIP  
β”‚
└── Optimization & Deployment
    β”œβ”€β”€ Model Compression
    β”‚   └─ Quantization, Pruning, Knowledge Distillation  
    β”œβ”€β”€ Hardware Acceleratio  
    β”‚   └─ TensorRT, ONNX-Runtime, FlashAttention  
    β”œβ”€β”€ Real-time Inference  
    β”‚   └─ Streamable Cross-Attention, Low-Latency Decoding  
    └── Edge & On-Device 
        └─ TFLite, PyTorch Mobile, CoreML


Spatiotemporal Modeling
β”œβ”€β”€ Core Architectures
β”‚   β”œβ”€β”€ 3D CNNs
β”‚   β”œβ”€β”€ Video Transformers
β”‚   β”œβ”€β”€ Graph Neural Networks
β”‚   └── Hybrid Models
β”œβ”€β”€ Attention Mechanisms 
β”‚   β”œβ”€β”€ Temporal Attention 
β”‚   β”œβ”€β”€ Spatial Attention 
β”‚   β”œβ”€β”€ Cross-Modal Attention
β”‚   └── Efficient Attention
β”œβ”€β”€ Advanced Techniques 
β”‚   β”œβ”€β”€ Neural Radiance Fields 
β”‚   β”œβ”€β”€ Flow-based Methods 
β”‚   β”œβ”€β”€ Memory Networks
β”‚   └── State Space Models
└── Optimization & Deployment 
    β”œβ”€β”€ Model Compression 
    β”œβ”€β”€ Hardware Acceleration
    β”œβ”€β”€ Real-time Processing 
    └── Edge Computing








Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • 4D Vision Foundation Models - 25
  • State Spaces Models - 25
  • Neuroscience Basis For Hearing & Vision - 25
  • Structures for Speech Processing - 25
  • Speech Processing x Transfer Learning - 25