Generative Models - 26
Welcome,
Letβs take a look at the history of Generative Models for Science
1. References
1.1 2024 NeurlPS tutorial - Flow Matching for Generative Modeling x 2024 Flow Matching Guide and Code
1.2 2023 Generative Models: An Interdisciplinary Perspective
2. Optical Experiments Simulation by Diffusion Models
2.1 2025 Tunable Optimally-Coded Snapshot Hyperspectral Imaging for Scene Adaptation
2.2 2023 Binarized Spectral Compressive Imaging
2.3 2022 Mask-Guided Spectral-Wise Transformer for Efficient Hyperspectral Image Reconstruction
3. Technical History of Todayβs GenAI Models
4. A Coding Sample from NeurlPS 2023 - Binarized Spectral Compressive Imaging
4.1 The Github Project A Toolbox for Binarized Spectral Compressive Imaging
5. Others
Topics
1. Multimodal Alignment
cross-attention
contrastive learning
CLIP - ViT + Transformer
2. Extreme Sparsity Self-Supervised Learning
Sparse Autoencoder
-
Masked Autoencoders - MAE
- CVPR 2022 - Masked Autoencoders Are Scalable Vision Learners Extreme Sparsity
Global Attention: O((HΒ·W)Β²) - Every pixel talks to every pixel
Mamba Scans: O(HΒ·W) - Linear recurrence per direction
Flow Matching Loss
Purpose
Enforce temporal smoothness by aligning latent representations of adjacent frames
Definition
\[\mathcal{L}_{\mathrm{flow}} = \bigl\lVert z_{t+1} - \mathrm{warp}(z_t, f_{t\to t+1}) \bigr\rVert_{1}\]- $z_t, z_{t+1}$: latent features of frame $t$ and frame $t+1$
- $\mathrm{warp}(z_t, f_{t \to t+1})$: features $z_t$ warped by the predicted flow field $f_{t \to t+1}$
- $f_{t \to t+1}$: optical flow field predicted by a lightweight network
- Domain
- Flow Matching applies to video frames (temporal consistency)
- Alignment Target
- Flow Matching aligns adjacent framesβ latent features
- Warping Operation
- Flow Matching includes a warp based on optical flow
- Goal
- Flow Matching improves frame-to-frame coherence in generated video
References
1. Spatiotemporal Video Encoding
-
Feichtenhofer, C., Fan, H., Malik, J., &
He, K
. βSlowFast Networks for Video Recognition.β CVPR 2019. - Capture both spatial detail (handshape, body posture) and temporal dynamics (movement trajectories)
- Common backbones include ViViT- or TimeSformer-style vision Transformers and 3D CNNs such as SlowFast or MViT
2. Pose & Keypoint Extraction
-
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S.-E., & Sheikh, Y. βOpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.β CVPR 2017.
- Real-time sign recognition often begins by detecting hand and body keypoints with tools like MediaPipe or OpenPose
- These skeletal sequences are then encoded by Graph Convolutional Networks (GCNs) or lightweight Transformers, reducing rawβpixel complexity
π 3. Multimodal Alignment & Fusion
-
Transformer, ViLBERT
- Aligning visual signals (hand motion, appearance) with text tokens relies on robust cross-attention mechanisms
- Architectures employ encoderβdecoder attention, co-attention blocks, and both early- and late-fusion strategies to integrate vision and language features
5. Adapter & Distillation Methods
- The Gemma family uses adapter modules to fine-tune large models with minimal overhead. Techniques like LoRA, prefix-tuning, and adapter fusion inject low-rank updates into fixed backbones
- Cross-modal distillationβcombining KL divergence on output distributions with hidden-state alignmentβtransfers knowledge from teacher to student efficiently
6. Quantization & On-Device Compression
- SignGemma targets mobile and edge deployment, requiring post-training quantization (8-bit, 4-bit), mixed-precision (FP16/BF16), and optimized transformer kernels (e.g., FlashAttention)
- Mastery of these methods ensures model size and compute footprint fit constrained hardware
7. Multilingual & π Zero-Shot Transfer
-
mBERTοΌMultilingual BERTοΌοΌDevlin et al. βBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.β NAACL 2019.
- Supporting multiple sign languages (e.g., ASL, BSL) demands domain adaptation strategies
- Key approaches include multilingual pre-training, adapter-based language extension, and contrastive videoβtext learning to enable zero-shot transfer across language variants
-
Text-to-Image Diffusion Models are Zero-Shot Video Generators
- Latent Memory Module
CLIP
[Input Image (224Γ224Γ3)] β [Vision Encoder: ViT-B/32 or RN50] β [Image Features (512)]
β β
[Resize + CenterCrop] [Layer Normalization]
β β
[ToTensor + Normalize] [Linear Projection Matrix]
β β
[CLIP Transform Pipeline] [L2 Normalization]
β
[InfoNCE Loss with Learnable Temperature]
β β
[torch.matmul(I_f, T_f.T) * exp(Ο)]
β
[Input Text (max 77 tokens)] β [Text Encoder: Transformer] β [Text Features (512)]
β β
[ftfy.fix_text() + html.unescape()] [Layer Normalization]
β β
[BPE Tokenization (49152 vocab)] [Linear Projection Matrix]
β β
[Add [SOS]/[EOS] tokens + Pad] [L2 Normalization]
High-Level Topics
Cross-Modal Alignment
βββ Core Architectures
β βββ Dual-Encoder Model
β β ββ CLIP (Contrastive LanguageβImage Pre-training)
β βββ Multimodal Transformers
β β ββ ViLBERT, UNITER, FLAVA, Florence
β βββ Co-Attention Networks
β β ββ VisualBERT, LXMERT
β βββ Graph-Based Alignment
β ββ MMG (Multimodal Graph), VLP-GNN
β
βββ Pretraining & Losses
β βββ Contrastive Learning
β β ββ InfoNCE, LoCL, SLiC (Hard Negative Mining)
β βββ Masked Multimodal Modeling
β β ββ VideoMAE+Text, MM-BEiT
β βββ ImageβText Matching (ITM)
β β ββ alignment head + triplet loss
β βββ Cross-Modal Distillation
β ββ CLIP Distill, FlorenceβTinyFlorence
β
βββ Attention Mechanisms
β βββ Cross-Attention
β βββ Co-Attention
β βββ Hierarchical Attention
β βββ Efficient Attention
β ββ Performer, Linformer, FlashAttention
β
βββ Advanced Techniques
β βββ Adapter & Prefix-Tuning
β β ββ LoRA, AdapterFusion
β βββ Prompt-Based Alignment
β β ββ CoOp, MaPLe
β βββ Multimodal Fusion Strategies
β β ββ Early-Fusion / Late-Fusion / Joint-Fusion
β βββ Self-Supervised Alignmen
β ββ SLIP, CLIP-style VisionβSpeech, VideoCLIP
β
βββ Optimization & Deployment
βββ Model Compression
β ββ Quantization, Pruning, Knowledge Distillation
βββ Hardware Acceleratio
β ββ TensorRT, ONNX-Runtime, FlashAttention
βββ Real-time Inference
β ββ Streamable Cross-Attention, Low-Latency Decoding
βββ Edge & On-Device
ββ TFLite, PyTorch Mobile, CoreML
Spatiotemporal Modeling
βββ Core Architectures
β βββ 3D CNNs
β βββ Video Transformers
β βββ Graph Neural Networks
β βββ Hybrid Models
βββ Attention Mechanisms
β βββ Temporal Attention
β βββ Spatial Attention
β βββ Cross-Modal Attention
β βββ Efficient Attention
βββ Advanced Techniques
β βββ Neural Radiance Fields
β βββ Flow-based Methods
β βββ Memory Networks
β βββ State Space Models
βββ Optimization & Deployment
βββ Model Compression
βββ Hardware Acceleration
βββ Real-time Processing
βββ Edge Computing
Enjoy Reading This Article?
Here are some more articles you might like to read next: