2025 - Master Thesis

Extreme Sparsity Self-Supervised Learning / Multimodal Alignment / xx


Foundation Model Improvements / Cross-Modal Alignment



Topics

  1. Multimodal Alignment

    • cross-attention
    • contrastive learning
  2. Extreme Sparsity Self-Supervised Learning


References

1. Spatiotemporal Video Encoding

  • Feichtenhofer, C., Fan, H., Malik, J., & He, K. “SlowFast Networks for Video Recognition.” CVPR 2019.

  • Capture both spatial detail (handshape, body posture) and temporal dynamics (movement trajectories)
  • Common backbones include ViViT- or TimeSformer-style vision Transformers and 3D CNNs such as SlowFast or MViT

2. Pose & Keypoint Extraction

  • Cao, Z., Hidalgo Martinez, G., Simon, T., Wei, S.-E., & Sheikh, Y. “OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields.” CVPR 2017.

  • Real-time sign recognition often begins by detecting hand and body keypoints with tools like MediaPipe or OpenPose
  • These skeletal sequences are then encoded by Graph Convolutional Networks (GCNs) or lightweight Transformers, reducing raw‐pixel complexity

📍 3. Multimodal Alignment & Fusion

  • Transformer, ViLBERT

  • Aligning visual signals (hand motion, appearance) with text tokens relies on robust cross-attention mechanisms
  • Architectures employ encoder–decoder attention, co-attention blocks, and both early- and late-fusion strategies to integrate vision and language features

4. Low-Latency Decoding - CTC + RNN-Transducer📍

  • Graves, A., Fernández, S., Gomez, F., & Schmidhuber, J. “Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks.” ICML 2006. - CTC

  • Graves, A. “Sequence Transduction with Recurrent Neural Networks.” ICASSP 2012. - RNN-T

  • For interactive translation, non-autoregressive decoders such as CTC or RNN-Transducer enable single-pass inference
  • Understanding the trade-offs between greedy decoding and beam search is essential for minimizing latency under resource constraints

5. Adapter & Distillation Methods

  • The Gemma family uses adapter modules to fine-tune large models with minimal overhead. Techniques like LoRA, prefix-tuning, and adapter fusion inject low-rank updates into fixed backbones
  • Cross-modal distillation—combining KL divergence on output distributions with hidden-state alignment—transfers knowledge from teacher to student efficiently

6. Quantization & On-Device Compression

  • SignGemma targets mobile and edge deployment, requiring post-training quantization (8-bit, 4-bit), mixed-precision (FP16/BF16), and optimized transformer kernels (e.g., FlashAttention)
  • Mastery of these methods ensures model size and compute footprint fit constrained hardware

7. Multilingual & 📍 Zero-Shot Transfer

  • mBERT(Multilingual BERT),Devlin et al. “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” NAACL 2019.

  • Supporting multiple sign languages (e.g., ASL, BSL) demands domain adaptation strategies
  • Key approaches include multilingual pre-training, adapter-based language extension, and contrastive video–text learning to enable zero-shot transfer across language variants



  • Veo3 - Deepmind
  • Gen-4 - Runway
  • Movie Gen - Meta
  • Flow Loss


Flow Matching Loss

Purpose
Enforce temporal smoothness by aligning latent representations of adjacent frames

Definition

\[\mathcal{L}_{\mathrm{flow}} = \bigl\lVert z_{t+1} - \mathrm{warp}(z_t, f_{t\to t+1}) \bigr\rVert_{1}\]
  • $z_t, z_{t+1}$: latent features of frame $t$ and frame $t+1$
  • $\mathrm{warp}(z_t, f_{t \to t+1})$: features $z_t$ warped by the predicted flow field $f_{t \to t+1}$
  • $f_{t \to t+1}$: optical flow field predicted by a lightweight network


  • Domain
    • Flow Matching applies to video frames (temporal consistency)
  • Alignment Target
    • Flow Matching aligns adjacent frames’ latent features
  • Warping Operation
    • Flow Matching includes a warp based on optical flow
  • Goal
    • Flow Matching improves frame-to-frame coherence in generated video


Total Loss

\[\mathcal{L}_{\mathrm{flow}} = \bigl\lVert z_{t+1} - \mathrm{warp}(z_t, f_{t\to t+1}) \bigr\rVert_{1}\]



References


Frontiers in AI Research (2025)


  1. Efficient Multimodal Alignment & Generation
    • Key Results:
      • CLIPDraw++ (NeurIPS ’24): unified vision–language alignment
      • Video-LLaMA (ICLR ’25): zero-shot text-to-video generation
    • Challenges: real-time deployment, fine-grained controllability, safety/robustness

📍 2. Long-Term Temporal & Structural Consistency

  • Key Results:
    • FlowFormer (CVPR ’25): flow-matching for video coherence
    • MemoryNeRF (NeurIPS ’24): implicit scene memory across seconds
  • Opportunities:
    • scalable frame-level memory modules
    • layered geometric+semantic caching
    • dynamic scene understanding

📍 3. Self-Supervised Learning from Extreme Sparsity

  • Key Results:
    • SparseMAE (ICML ’25): masked autoencoding with <0.1 % tokens
    • Contrastive-Sparse (ICLR ’24): adaptive masking focus on high-entropy regions
  • Goals:
    • near-fully-supervised performance with ‰-level labels
    • unified multi-task pretraining (classification, detection, generation)
  1. Differentiable Physics & Hybrid Simulation
    • Key Results:
      • DiffPhys (NeurIPS ’24): end-to-end differentiable physics engine
      • FluidNeRF (CVPR ’25): fluid simulation within NeRF framework
    • Directions:
      • trainable raytracing and material modules
      • learned+classical simulator hybrids
      • transferable “physical basis” representations
  2. Verifiable Robustness & Explainable Security
    • Key Results:
      • Certified Diffusion Robustness (ICLR ’25)
      • Provable Transformer Defenses (NeurIPS ’24)
    • Imperatives:
      • certified adversarial bounds
      • causal traceability in generation/decision chains
      • end-to-end system-level trust guarantees


📍 1. DiT (Diffusion Transformer)

  • Overview: Combines Transformer context modeling with diffusion denoising
  • Examples
    1. KeyFace – speech-driven face animation via stepwise denoising
    2. DiffLocks – high-fidelity hair generation
    3. Pippo – multi-view rendering with geometric and texture coherence
  • Benefit: Maintains character appearance/style across shots and supports conditional, coherent animation

📍 2. Diadic Models

  • Concept: Model both speaking and listening behaviors for interactive avatars
  • Examples
    • INFP / DualTalk: dual-branch networks for speaker lip sync and listener micro‐expressions
  • Insight: Ensures consistent identity/style in extended dialogues by modeling two-way interaction

3. Priors

  • Synthetic Priors (GASP, SynShot)
    • Generate “pseudo-real” head avatars (poses, expressions, lighting) to enrich training data
    • Improves generalization to extreme poses and rare expressions
  • Diffusion-based Priors (CAP4D, GAF)
    • Use pretrained diffusion models to produce high-quality 3D avatars or dynamic sequences
    • Accelerates multi-view/multi-expression data generation and boosts video consistency

4. Implications

  • Architecture: Adopt DiT’s diffusion-Transformer for cross-scene realface rendering
  • Interaction Consistency: Integrate diadic modeling to handle speaking and listening coherently
  • Memory Extension: Add a latent memory module to preserve character traits across sessions






References