2026 - Thesis - Liver Predictor

SSL, USZ


References

  • ✨ 2011 - Coldplay

  • 📍 ViT: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale, ICLR 2021.
  • Large-scale pancreatic cancer detection via non-contrast CT and deep learning, Nature 2023.
  • BYOL: Bootstrap your own latent: A new approach to self-supervised Learning, Google Deepmind, NeurIPS 2020.
  • CLIP: Learning Transferable Visual Models From Natural Language Supervision, ICML 2021.
  • AlexNet: ImageNet Classification with Deep Convolutional Neural Networks, NeurIPS 2012.
  • 📍 ResNet: Deep Residual Learning, CVPR 2015.
  • 2026 - Scaling medical imaging report generation with multimodal reinforcement learning
        Representation Learning
                   ▲
                   │
Small / Limited Supervision ──── Tabular / Medical Data

Moderate clustering metrics across PCA, t-SNE, and UMAP indicate non-random latent structure but insufficient outcome separability, highlighting the need for representation learning beyond geometric proximity in raw tabular space.


Impact of Realistic Quantum Noise Modeling on System Capability

Noise Characteristic Consequence If Ignored Capability When Properly Modeled
Non-Gaussian Noise Rare but large noise events catastrophically break control loops and invalidate average-error assumptions Control policies become robust to outliers and extreme events, enabling reliable operation under real hardware conditions
Temporal Drift Continuous manual recalibration is required as device parameters slowly change over time The system adapts online, tracking slow parameter drift automatically and maintaining performance without human intervention
Non-Markovian Memory Effects Gate fidelity collapses because past operations influence future behavior in unmodeled ways Long-horizon stability is achieved by learning history-dependent dynamics and compensating for hardware memory
Spatial and Temporal Correlations Crosstalk accumulates across qubits, causing errors to scale with system size Coordinated, chip-level control strategies emerge that actively suppress correlated errors
Physical Noise Origins Noise is treated as an abstract nuisance, offering no guidance for improving hardware Learned models expose actionable physical causes, directly informing materials, layout, and electronics design
High-Dimensional Structure Simplified models fail to represent reality and break as system size grows Scalable AI models capture latent structure, enabling control and optimization of large-scale quantum processors


The ‘Right Abstraction’ for A System

Category                          | Focus Area / Topics
----------------------------------|------------------------------------------------------------
Representation & Inductive Bias   | Network architecture design
                                  | Multimodal alignment (vision–language–action)
                                  | World models
                                  | 
                                  | Impact of good inductive bias:
                                  | - 10× less data required
                                  | - 100× lower training cost

Learning × Control                | Model-based reinforcement learning
                                  | Differentiable MPC
                                  | Latent dynamics models
                                  | Structured sim-to-real methods
                                  |
                                  | Key insight:
                                  | - Control background is a major advantage
                                  | - Not brute-force learning

System-level AI                   | ML compilers
                                  | Scheduling on heterogeneous hardware
                                  | Memory-aware training
                                  | Inference optimization
                                  |
                                  | These areas strongly reward intelligent
                                  | system and structure design

Robotics / Embodied AI            | Not actuators, SEA, or motors
                                  | (These are constrained by physics)
                                  |
                                  | Instead focus on:
                                  | - Contact representation
                                  | - Hybrid system abstraction
                                  | - Task decomposition
                                  | - Perception–control interfaces
                                  | - Failure-aware planning
                                  |
                                  | Goal:
                                  | - Replace large sets of heuristics with
                                  |   a unified abstraction


Liver Donor - 39 Real-world Patient Cases

FEATURES = [
    "donor_age",
    "AST",
    "ALT",
    "bilirubin",
    "DCD",
    "cold_ischemia_time",
    "warm_ischemia_time",
]

encoder = TabularEncoder(input_dim=len(FEATURES)*2)
classifier = TransplantabilityHead(z_dim=64)

encoder.eval()
classifier.eval()

p = predict_transplantability(
    "donor_012.json",
    encoder,
    classifier,
    FEATURES
)

print(f"Predicted P(TX) = {p:.3f}")


References


Robotics

1. Series Elastic Actuators


2. Advances in self-supervised multimodal learning - Prof. Dr. Hilde Kuehne (Tuebingen AI Center)


3. Products



4D Gaussian Formulation

Property 3D Gaussian Splatting 4D Gaussian Fields
Temporal modeling Static scene Dynamic, time-dependent scene
Parameterization Fixed $(\mu_i, \Sigma_i, c_i)$ Functions $(\mu_i(t), \Sigma_i(t), c_i(t))$
Motion representation None Explicit velocity field $v(\mathbf{x}, t)$
Topology handling Fixed structure Supports appearance/disappearance
Continuity Spatial smoothness Spatiotemporal smoothness
Rendering Per-frame splatting Motion-compensated splatting


Metric-Scale 3D Reconstruction

Definition

  • “Metric-scale” means that the reconstructed 3D scene is expressed in real-world physical units (e.g., meters) rather than up-to-scale or normalized units.

  • Up-to-scale reconstruction recovers only the structure’s shape, not its real scale factor.
    Example: a room reconstructed as either 3 m or 30 m wide appears identical.
  • Metric-scale reconstruction estimates a global scale parameter that converts the up-to-scale 3D structure into real-world dimensions.
  • In MapAnything, a global metric scale token $m$ is predicted such that: \(X_i^{metric} = m \cdot X_i^{\sim}\) where $X_i^{\sim}$ is the up-to-scale reconstruction.

Pipeline

Objective

  • To perform feed-forward, metric-scale 4D reconstruction of dynamic scenes using a Time-Varying Generalized Camera model.

Concept

  • MapAnything models static multi-view geometry using a generalized camera (fixed light-ray set).

\(\mathcal{C}(t) = \{ (p_i(t), d_i(t)) \}\) where each pixel corresponds to a ray with a time-dependent origin $p_i(t)$ and direction $d_i(t)$.


Input Design

Input Type Description Example
Image sequence $I_t$ Temporal image frames or asynchronous event accumulations RGB / event frames
Geometric priors Extrinsics, intrinsics, sparse depth, IMU VICON, COLMAP, SLAM
Time label $t$ Frame or event timestamp μs or ms
Optional motion prior Scene flow or optical flow initialization RAFT3D, DynamicStereo

Model Architecture

Encoder

  • Vision Transformer backbone (e.g., DINOv2 / ViT-L)
  • Temporal Positional Encoding (TPE): \(\text{TPE}(t) = \sin(\omega t + \phi)\)
  • Token = image patch + geometric features + time embedding

Transformer Core

  • Based on MapAnything’s Alternating-Attention Transformer
  • Extended to cross-attention over (views × time)
  • Introduce motion-aware attention block modeling $\partial p / \partial t$

Decoder Heads

  • Ray directions: $R_i(t)$
  • Depths along rays: $D_i(t)$
  • Camera poses: $P_i(t) = [R_i(t), T_i(t)]$
  • Global scale: $m(t)$
  • Scene flow: $F_i(t) = X_i(t+\Delta t) - X_i(t)$
  • Temporal clustering: cluster latent features by motion patterns

Loss Functions

Loss Meaning Expression
$L_{geom}$ Geometric consistency (RDP structure) As in MapAnything
$L_{metric}$ Metric scale consistency $|\log m_{\text{pred}} - \log m_{\text{gt}}|$
$L_{flow}$ Temporal scene flow consistency $|X_t + F_t - X_{t+\Delta t}|$
$L_{cluster}$ Motion clustering Contrastive or self-distillation
$L_{smooth}$ Temporal smoothness $|X_{t+1} - 2X_t + X_{t-1}|$
$L_{mask}$ Dynamic mask supervision BCE or uncertainty weighting
  • Adaptive Robust Loss is used to weight all residuals (see Section 8).

Output

  • The model outputs: \(\mathcal{O} = \{ X_i(t), P_i(t), F_i(t), m(t), C_i(t) \}\)

Where:

  • $X_i(t)$: metric 3D points
  • $P_i(t)$: camera poses
  • $F_i(t)$: scene flow
  • $m(t)$: metric scale
  • $C_i(t)$: motion clusters

Training Strategy

Stage Goal Data
1. Static pretraining Learn static geometry and scale MapAnything datasets
2. Temporal alignment Temporal consistency learning Dynamic Replica / TartanAirV2
3. Spatio-temporal fine-tuning Train flow and clustering heads Synthetic dynamic datasets
4. Self-supervised finetuning Real data adaptation Photometric + geometric consistency

Adaptive Robust Loss

Core Idea

  • Adaptive Robust Loss is a general parametric loss family that unifies and generalizes $L_2$, $L_1$, Cauchy, Geman–McClure, and other robust losses under a single formulation.

General form

\[L(x; \alpha, c) = \frac{|\alpha - 2|}{\alpha} \left( \left( \frac{(x/c)^2}{|\alpha - 2|} + 1 \right)^{\alpha/2} - 1 \right)\]

where:

  • $\alpha$: shape parameter controlling robustness
  • $c$: scale parameter controlling residual normalization

Special cases

$\alpha$ Equivalent Loss Behavior
2 L2 (Gaussian) Sensitive, fast convergence
1 L1 (Laplacian) Moderately robust
0 Cauchy Heavy-tailed, robust
-2 Geman–McClure Very robust
$\to \infty$ Welsch / Tukey Bounded, ignores outliers

Adaptive Mechanism

$\alpha$ and $c$ are learnable via backpropagation, allowing the model to automatically tune its robustness:

  • At early stages: smaller $\alpha$ → higher robustness
  • Later: $\alpha \to 2$ → smoother convergence

  • This adaptivity stabilizes training on long-tailed error distributions common in visual geometry.

Benefits

  • Unifies all standard robust losses
  • Automatically adjusts to dataset noise level
  • Requires no manual tuning
  • Widely used in SLAM, SfM, VO, and 3D reconstruction tasks

Evaluation Metrics

Category Metric
Geometry Depth rel, τ, ATE RMSE
Temporal consistency Flow EPE, Temporal Chamfer distance
Clustering Adjusted Rand Index (ARI), mIoU
Scale Relative Scale Error
Overall Reconstruction quality over time


What Kinds of Computer Vision Tasks It Can Be Applied To

Task Category Example Tasks Why Transfer Learning Helps
Image Classification Object or scene classification Reuses low- and mid-level visual features learned from large datasets.
Object Detection Bounding box localization and recognition Transfers backbone representations to detection heads.
Semantic Segmentation Pixel-level labeling of images Leverages shared visual structure across tasks.
Depth Estimation Predicting depth or geometry from images Adapts learned visual cues to geometric inference.
Video Understanding Action recognition, temporal perception Transfers spatial features to temporal models.
Domain Adaptation Cross-domain image understanding Allows adaptation between different visual domains.
Robotics Perception Object recognition for manipulation Enables rapid adaptation to new environments.
Autonomous Systems Road scene understanding, obstacle detection Shares representations across perception subtasks.
Multi-task Learning Systems Unified perception pipelines Supports multiple vision tasks using a shared model backbone.



Multi-modal Inference

  • The capacity to accurately interpret multimodal inputs typically only emerges in large models with billions of parameters


References


Others


References