2025 - Master Thesis
Extreme Sparsity Self-Supervised Learning / Multimodal Alignment / xx
Foundation Model Improvements / Cross-Modal Alignment
Topics
-
Multimodal Alignment
cross-attention
contrastive learning
-
Extreme Sparsity Self-Supervised Learning
Sparse Autoencoder
-
Masked Autoencoders - MAE
- CVPR 2022 - Masked Autoencoders Are Scalable Vision Learners Extreme Sparsity
References
1. Spatiotemporal Video Encoding
-
Feichtenhofer, C., Fan, H., Malik, J., &
He, K
. “SlowFast Networks for Video Recognition.” CVPR 2019. - Capture both spatial detail (handshape, body posture) and temporal dynamics (movement trajectories)
- Common backbones include ViViT- or TimeSformer-style vision Transformers and 3D CNNs such as SlowFast or MViT
2. Pose & Keypoint Extraction
-
Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S.-E., & Sheikh, Y. “OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.” CVPR 2017.
- Real-time sign recognition often begins by detecting hand and body keypoints with tools like MediaPipe or OpenPose
- These skeletal sequences are then encoded by Graph Convolutional Networks (GCNs) or lightweight Transformers, reducing raw‐pixel complexity
📍 3. Multimodal Alignment & Fusion
-
Transformer, ViLBERT
- Aligning visual signals (hand motion, appearance) with text tokens relies on robust cross-attention mechanisms
- Architectures employ encoder–decoder attention, co-attention blocks, and both early- and late-fusion strategies to integrate vision and language features
4. Low-Latency Decoding
- CTC + RNN-Transducer
📍
-
Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.” ICML 2006. - CTC
-
Graves, A. “Sequence Transduction with Recurrent Neural Networks.” ICASSP 2012. - RNN-T
- For interactive translation, non-autoregressive decoders such as CTC or RNN-Transducer enable single-pass inference
- Understanding the trade-offs between greedy decoding and beam search is essential for minimizing latency under resource constraints
5. Adapter & Distillation Methods
- The Gemma family uses adapter modules to fine-tune large models with minimal overhead. Techniques like LoRA, prefix-tuning, and adapter fusion inject low-rank updates into fixed backbones
- Cross-modal distillation—combining KL divergence on output distributions with hidden-state alignment—transfers knowledge from teacher to student efficiently
6. Quantization & On-Device Compression
- SignGemma targets mobile and edge deployment, requiring post-training quantization (8-bit, 4-bit), mixed-precision (FP16/BF16), and optimized transformer kernels (e.g., FlashAttention)
- Mastery of these methods ensures model size and compute footprint fit constrained hardware
7. Multilingual & 📍 Zero-Shot Transfer
-
mBERT(Multilingual BERT),Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019.
- Supporting multiple sign languages (e.g., ASL, BSL) demands domain adaptation strategies
- Key approaches include multilingual pre-training, adapter-based language extension, and contrastive video–text learning to enable zero-shot transfer across language variants
-
Text-to-Image Diffusion Models are Zero-Shot Video Generators
- Latent Memory Module
- Veo3 - Deepmind
- Gen-4 - Runway
- Movie Gen - Meta
- Flow Loss
Flow Matching Loss
Purpose
Enforce temporal smoothness by aligning latent representations of adjacent frames
Definition
\[\mathcal{L}_{\mathrm{flow}} = \bigl\lVert z_{t+1} - \mathrm{warp}(z_t, f_{t\to t+1}) \bigr\rVert_{1}\]- $z_t, z_{t+1}$: latent features of frame $t$ and frame $t+1$
- $\mathrm{warp}(z_t, f_{t \to t+1})$: features $z_t$ warped by the predicted flow field $f_{t \to t+1}$
- $f_{t \to t+1}$: optical flow field predicted by a lightweight network
- Domain
- Flow Matching applies to video frames (temporal consistency)
- Alignment Target
- Flow Matching aligns adjacent frames’ latent features
- Warping Operation
- Flow Matching includes a warp based on optical flow
- Goal
- Flow Matching improves frame-to-frame coherence in generated video
Total Loss
\[\mathcal{L}_{\mathrm{flow}} = \bigl\lVert z_{t+1} - \mathrm{warp}(z_t, f_{t\to t+1}) \bigr\rVert_{1}\]References
Frontiers in AI Research (2025)
- Efficient Multimodal Alignment & Generation
- Key Results:
- CLIPDraw++ (NeurIPS ’24): unified vision–language alignment
- Video-LLaMA (ICLR ’25): zero-shot text-to-video generation
- Challenges: real-time deployment, fine-grained controllability, safety/robustness
- Key Results:
📍 2. Long-Term Temporal & Structural Consistency
- Key Results:
- FlowFormer (CVPR ’25): flow-matching for video coherence
- MemoryNeRF (NeurIPS ’24): implicit scene memory across seconds
- Opportunities:
- scalable frame-level memory modules
- layered geometric+semantic caching
- dynamic scene understanding
📍 3. Self-Supervised Learning from Extreme Sparsity
- Key Results:
- SparseMAE (ICML ’25): masked autoencoding with <0.1 % tokens
- Contrastive-Sparse (ICLR ’24): adaptive masking focus on high-entropy regions
- Goals:
- near-fully-supervised performance with ‰-level labels
- unified multi-task pretraining (classification, detection, generation)
- Differentiable Physics & Hybrid Simulation
- Key Results:
- DiffPhys (NeurIPS ’24): end-to-end differentiable physics engine
- FluidNeRF (CVPR ’25): fluid simulation within NeRF framework
- Directions:
- trainable raytracing and material modules
- learned+classical simulator hybrids
- transferable “physical basis” representations
- Key Results:
- Verifiable Robustness & Explainable Security
- Key Results:
- Certified Diffusion Robustness (ICLR ’25)
- Provable Transformer Defenses (NeurIPS ’24)
- Imperatives:
- certified adversarial bounds
- causal traceability in generation/decision chains
- end-to-end system-level trust guarantees
- Key Results:
📍 1. DiT (Diffusion Transformer)
- Overview: Combines Transformer context modeling with diffusion denoising
- Examples
- KeyFace – speech-driven face animation via stepwise denoising
- DiffLocks – high-fidelity hair generation
- Pippo – multi-view rendering with geometric and texture coherence
- Benefit: Maintains character appearance/style across shots and supports conditional, coherent animation
📍 2. Diadic Models
- Concept: Model both speaking and listening behaviors for interactive avatars
- Examples
- INFP / DualTalk: dual-branch networks for speaker lip sync and listener micro‐expressions
- Insight: Ensures consistent identity/style in extended dialogues by modeling two-way interaction
3. Priors
- Synthetic Priors (GASP, SynShot)
- Generate “pseudo-real” head avatars (poses, expressions, lighting) to enrich training data
- Improves generalization to extreme poses and rare expressions
- Diffusion-based Priors (CAP4D, GAF)
- Use pretrained diffusion models to produce high-quality 3D avatars or dynamic sequences
- Accelerates multi-view/multi-expression data generation and boosts video consistency
4. Implications
- Architecture: Adopt DiT’s diffusion-Transformer for cross-scene realface rendering
- Interaction Consistency: Integrate diadic modeling to handle speaking and listening coherently
- Memory Extension: Add a latent memory module to preserve character traits across sessions
-
2025 - Runway Gen-4 solves AI video’s biggest problem: character consistency across scenes
- 2025 - New York is a Zoo
- 2025 - The Retrieval