2026 - Master Thesis - Space
LiDAR Free, 4D Brain Mapping, PRS
Topics
- EU Open Research Repository, AiiDA.net
- CERN, PSI
- 2025 - π ZapBench, PathFinder
- 2026 - How the brainβs wiring changes
- 2026 - SpaceX
Coding
References
- CVPR
-
If a team / mentor can tolerate you saying "This has no information" and listen carefully to the rest of your sentence, then it is a very good peer / team. - DiffusionDrive, CVPR highlight 2025.
- π Disentangling Monocular 3D Object Detection, ICCV 2019.
- The core method of 3D perception that
does not rely on LiDARlaid the foundation for many subsequent 3D Tracking and 3D MOT vision methods.
- The core method of 3D perception that
- Monocular 3D Object Detection Leveraging Accurate Proposals and Shape Reconstruction, CVPR 2019.
- Monocular Quasi-Dense 3D Object Tracking, 2021.
- Multi-Level Fusion based 3D Object Detection from Monocular Images, CVPR 2018.
- Development of the Nervous System, Prof. Dr. Stoeckli Esther
Multi-sensor Input Fusion From Space, Safety Detection
- 1960 - A New Approach to Linear Filtering and Prediction Problems, Kalman
- 2005 - Probabilistic Robotics, Multi-sensor Input Fusion
- 2025 - ACDC Dataset, training and testing semantic perception on adverse visual conditions
- π 2019 - Calibration Wizard: A Guidance System for Camera Calibration Based on Modelling Geometric and Corner Uncertainty
Topics
0. Sensor Modalities and Data Types
| Modality | Sensor Type | Data Representation |
|---|---|---|
| Optical | Visible-light satellite camera | 3-channel RGB image (8-bit) |
| SAR | Synthetic Aperture Radar | 1-channel SAR image (32-bit float) |
1. Maritime Search and Rescue
Optical satellite images
+ SAR satellite images
β Ship Detection
β Ship Re-Identification (ReID)
β Trajectory generation & route prediction
| Platform | Strength | Fundamental Limitation |
|---|---|---|
| GEO satellites | Wide coverage, high temporal resolution | Low spatial resolution |
| Video satellites | High spatial & temporal resolution | Short duration, small coverage |
| AIS-based systems | Accurate identity info | Only works for cooperative targets |
| Axis | Examples |
|---|---|
| Sensors | Optical, SAR, LiDAR, multispectral |
| Tasks | Detection, ReID, tracking, mapping |
| Scale | Local β Global |
| Time | Snapshot β Long-term monitoring |
2. Input Data Type
| Modality | Data Type | Format |
|---|---|---|
| Optical | RGB image | 3-channel, 8-bit TIF |
| SAR | Radar backscatter | 1-channel, 32-bit float TIF |
| Geometry | Ship size (derived) | Numeric vector (length, width, aspect ratio) |
3. Fusion Space
Optical image ββ
ββ Dual-head tokenizer β Shared Transformer Encoder β Unified embedding
SAR image ββ
4. Output Data
| Stage | Output Used |
|---|---|
| ReID | Feature distance matrix |
| Tracking | Identity association |
| Trajectory | Time-ordered identity matches |
A Dynamic Camera with Multi-modal Input Signal Fusion
Human perception
ββββββββββββββββββββ
β Vestibular β
β Vision β
ββββββββββββββββββββ
β²
β
ββββββββββββββ΄βββββββββββββββββ
β Wearable System Estimation β
ββββββββββββββ¬βββββββββββββββββ
β
βββββββββ¬βββββββββ¬βββββββββ¬βββββββββ¬βββββββββ
β Cameraβ IMU β Eye β Depth β Others β
β β β trackerβ / ToF β β
βββββββββ΄βββββββββ΄βββββββββ΄βββββββββ΄βββββββββ
Core Formulation: Bayesian Multi-Modal Sensor Fusion
Latent State Definition
-
At time step t, the latent state is defined as: $x_t = { T_t, \theta, \psi_t }$
-
where $T_t$ denotes the device pose, $\theta$ represents the calibration parameters shared across time, and $\psi_t$ denotes user-centric latent variables.
Multi-Modal Observations
-
Given heterogeneous sensor measurements at time t: $z_t = { z_t^{cam}, z_t^{imu}, z_t^{eye} }$
-
where observations are obtained from the camera, IMU, and eye-tracking modalities.
Bayesian Fusion Objective
-
Multi-modal fusion is defined as inference over the joint posterior: $p(x_{1:T} \mid z_{1:T})$
-
Using the Markov assumption and conditional independence of observations, the posterior factorizes as: $p(x_{1:T} \mid z_{1:T}) \propto \prod_{t=1}^{T} p(z_t \mid x_t)\, p(x_t \mid x_{t-1})$
Multi-Modal Likelihood Factorization
- Assuming conditional independence between sensor modalities given the latent state: $p(z_t \mid x_t) = p(z_t^{cam} \mid x_t)\, p(z_t^{imu} \mid x_t)\, p(z_t^{eye} \mid x_t)$
State Transition Model
-
The temporal evolution of the latent state is modeled as: $p(x_t \mid x_{t-1}) = p(T_t \mid T_{t-1})\, p(\psi_t \mid \psi_{t-1})\, p(\theta)$
-
where $\theta$ is treated as a time-invariant latent variable, $p(\theta)$ enforces temporal consistency of calibration parameters.
Interpretation
- Fusion thus corresponds to Bayesian state estimation under uncertainty, where heterogeneous sensor observations impose probabilistic constraints on a shared latent state evolving over time. Calibration parameters are inferred jointly with pose and user states, enabling online self-calibration.
Sensor Models
- $z_t^{imu} = h_{imu}(T_{t-1}, T_t) + \epsilon_{imu}$
- $z_t^{cam} = h_{cam}(T_t, \theta) + \epsilon_{cam}$
- $z_t^{eye} = h_{eye}(T_t, \psi_t) + \epsilon_{eye}$
Filtering Approximation
For online inference, we approximate the posterior using Bayesian filtering.
- Prediction: $p(x_t \mid z_{1:t-1}) = \int p(x_t \mid x_{t-1}) p(x_{t-1} \mid z_{1:t-1}) dx_{t-1}$
- Update: $p(x_t \mid z_{1:t}) \propto p(z_t \mid x_t) p(x_t \mid z_{1:t-1})$
Practical Filtering Choices under Self-Calibrated Camera Constraints
| Method | Inference Principle | Handles High-Dimensional State | Real-Time / Online | Geometric Interpretability | Typical Failure Mode | Suitability for Your Pipeline |
|---|---|---|---|---|---|---|
| Full Bayesian Filtering | Exact posterior inference $p(x_t \mid y_{1:t})$ | No (intractable) | No | Theoretically yes | Intractable integrals | β (theoretical only) |
| Particle Filter | Sampling-based Bayesian inference | Poor (curse of dimensionality) | No | Weak (implicit geometry) | Sample degeneracy | β |
Kalman Filter (KF) | Linear-Gaussian Bayesian inference | Moderate | Yes | Strong (explicit states) | Model mismatch | β (baseline) |
| Extended Kalman Filter (EKF) | Local linearization of nonlinear models | ModerateβHigh | Yes | Strong | Linearization error | ββ |
| Unscented Kalman Filter (UKF) | Sigma-point approximation | Moderate | Borderline | Strong | Computational cost | β³ |
| Information Filter | KF in information (precision) form | High | Yes | Strong | Numerical instability | β |
| Factor Graph / Smoothing | MAP estimation over state graph | High | Semi-online | Very strong | Latency / memory | ββ (geometry modules) |
| Continuous-Time Filters | Trajectory as continuous function | High | Yes | Strong | Model complexity | ββ |
| Variational Bayesian Filters | Approximate posterior optimization | High | No | WeakβModerate | Approximation bias | β |
| Neural / Learned Filters | Learned belief update | High | Yes | Weak (opaque) | Geometry drift | β (as core filter) |
Method Selection Is Constraint-Driven, Not Aesthetic
| Hard Constraint | Practical Interpretation | Technical Implication |
|---|---|---|
| Online, real-time, low latency | The system runs on an wearable device worn by a human user. End-to-end latency above tens of milliseconds leads to motion sickness and unacceptable user experience. | Any method that is offline, batch-only, or exhibits unstable latency is infeasible and must be excluded. |
| High-dimensional continuous state space | The system state includes not only camera pose but also velocity, IMU biases, camera intrinsics and extrinsics, and temporal offsets between sensors. | The resulting state space is high-dimensional, continuous, and strongly nonlinear, making general inference methods computationally intractable. |
| Geometric honesty and interpretability | Solutions must be physically and geometrically valid, not merely visually plausible. Calibration parameters must correspond to real camera models and be diagnosable when errors occur. | Methods that produce visually convincing but geometrically inconsistent results are unacceptable. Explicit state representation and interpretable uncertainty are required. |
Why Gaussian
- For Closure Under Bayesian Operations
-
Bayesian filtering requires two fundamental operations that are applied recursively over time.
- Prediction
- The prediction step propagates the belief forward in time using the system dynamics:
$p(x_t \mid z_{1:t-1}) = \int p(x_t \mid x_{t-1}) p(x_{t-1} \mid z_{1:t-1}) dx_{t-1}$
- Update
- The update step incorporates the new observation into the predicted belief:
$p(x_t \mid z_{1:t}) \propto p(z_t \mid x_t)\, p(x_t \mid z_{1:t-1})$
- Gaussian distributions possess a crucial closure property under these Bayesian operations:
- The product of two Gaussian distributions is Gaussian.
- The marginalization of a joint Gaussian distribution is Gaussian.
- As a consequence:
- The prediction step preserves Gaussianity.
- The update step preserves Gaussianity.
- Without this closure property, the posterior distribution does not remain in a tractable functional family, and Bayesian filtering becomes analytically intractable.
Bayesian FilterγKalman Filter, Gaussian Distribution
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Bayesian Filtering β
β β
β p(x_t | z_{1:t}) β p(z_t | x_t) p(x_t | z_{1:t-1}) β
β β
β β’ General probabilistic inference framework β
β β’ Arbitrary distributions β
β β’ Arbitrary nonlinear dynamics β
β β’ Arbitrary observation models β
β β
β (Intractable in general) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
β
β Gaussian assumption
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Kalman-style Filtering β
β β
β Assumption: β
β p(x_t | z_{1:t}) β π©(ΞΌ_t, Ξ£_t) β
β β
β β’ Posterior represented only by mean + cov β
β β’ Recursive closed-form updates β
β β’ Efficient and online β
β β
β Includes: β
β - Kalman Filter (linear) β
β - EKF (local linearization) β
β - UKF (sigma-point) β
βββββββββββββββββββββββββ¬ββββββββββββββββββββββββ
β
β Linear model + Gaussian noise
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββ
β Kalman Filter β
β β
β x_t = A x_{t-1} + w_t , w_t ~ π©(0, Q) β
β z_t = H x_t + v_t , v_t ~ π©(0, R) β
β β
β β’ Exact Bayesian inference β
β β’ Optimal under linear-Gaussian assumptions β
βββββββββββββββββββββββββββββββββββββββββββββββββ
For Each Time Step
Time t-1 belief Prediction Update
(posterior at t-1) (motion model) (sensor fusion)
p(x_{t-1}|z_{1:t-1}) p(x_t|z_{1:t-1}) p(x_t|z_{1:t})
~ π©(ΞΌ_{t-1}, Ξ£_{t-1}) β ~ π©(ΞΌ_t^-, Ξ£_t^-) β ~ π©(ΞΌ_t, Ξ£_t)
β β β
β β β
βΌ βΌ βΌ
Gaussian belief Gaussian prediction Gaussian posterior
Multiple sensors = multiple Gaussian constraints on the same state
z_t^cam
(camera likelihood)
β
βΌ
ββββββββββ
z_t^imu ββββββββΆ β x_t β βββββββ z_t^eye
(IMU likelihood) β latent β (eye-tracking likelihood)
β state β
ββββββββββ
Best Normalization
| Data Distribution Characteristics | Method | Formula | Core Assumption |
|---|---|---|---|
| Gaussian-like Distribution | Standard z-score normalization | $z = \dfrac{x - \mu}{\sigma}$ | Most data points are concentrated near the mean; few outliers exist. |
| Skewed or Heavy-Tailed Distribution | Robust z-score (Median + MAD) | $z = \dfrac{x - \mathrm{median}}{\mathrm{MAD}}$ | Extreme values exist; the median provides a more stable estimate. |
| Bounded Values (0β1, Ratio-type Data) | MinβMax normalization | $xβ = \dfrac{x - x_{\min}}{x_{\max} - x_{\min}}$ | Data lies within a fixed range; preserving proportional relationships is important. |
| Log-Normal or Multiplicative Noise Data | Log transform + z-score | $\log(x)$ or $\log(1 + x);\rightarrow;$ z-score | Noise varies multiplicatively; log transformation linearizes it. |
| Mixed Noise or Asymmetric Distributions | Quantile normalization / Rank transform | $x \mapsto \mathrm{rank}(x)$ or quantile mapping | Exact values are less important; relative ordering matters. |
Brain Signals (Why Median + MAD)
| Property | Meaning | Impact |
|---|---|---|
| Non-stationary | The mean varies across time and sessions | Mean and standard deviation become unstable |
| Heavy-tailed distribution | Strong artifacts or high-amplitude spikes | Standard deviation is inflated by outliers |
| Weak signal + mixed noise | High-frequency oscillations + low-frequency drift | Large mean variation, clear skewness |
| Inter-channel variation | Each sensor has different sensitivity | Requires independent per-channel normalization |
A. Unresolved Core Problems in Modern CVPR (Post-Deep Learning Era)
| Problem Area | What the Problem Really Is | Why It Is Still Unsolved | How CVPR Papers Currently Cope | Why Best Papers Still Miss It |
|---|---|---|---|---|
| Continuous-time modeling | Vision models are fundamentally discrete, but the world is continuous in time | Continuous-time inference is mathematically harder; requires differential equations and observability theory | Discretization, frame aggregation, splines, heuristic interpolation | Best papers optimize within discretized assumptions instead of fixing the time model |
| Temporal causality | Models confuse correlation across frames with causal structure | Causality requires intervention and counterfactual reasoning, not passive data | Self-supervision, temporal contrastive losses | These methods improve prediction, not causal understanding |
| Identifiability | Whether the true scene/state is uniquely recoverable from data | Identifiability depends on geometry, noise, and sensor configuration | Overparameterization hides non-identifiability | Best papers report accuracy, not whether the solution is meaningful |
| Geometryβlearning consistency | Learned representations often violate geometric invariants | Neural networks lack built-in structure preservation | Add geometry as loss terms or regularizers | Geometry is treated as decoration, not a first-class constraint |
| Probabilistic correctness | Most βuncertaintyβ estimates are not valid probabilities | Proper probabilistic modeling is expensive and restrictive | Softmax scores, Monte Carlo dropout | Best papers optimize calibration metrics without true probabilistic guarantees |
| Sensor modeling | Real sensors are nonlinear, asynchronous, and imperfect | Accurate sensor models complicate learning pipelines | Synthetic data, simplified sensor assumptions | Papers assume idealized sensors to keep benchmarks manageable |
| Scale vs. meaning | Scaling improves performance without improving understanding | Optimization rewards accuracy, not interpretability | Larger models, more data | Best papers often demonstrate scale, not conceptual progress |
| Benchmark validity | Benchmarks measure proxies, not the intended task | Ground truth is often ill-defined or biased | Dataset curation and metric tuning | Best papers win benchmarks without questioning what they measure |
| Failure characterization | Knowing when and why a model fails | Requires negative results and adversarial analysis | Ignore rare or hard cases | Best papers are structurally biased against failure analysis |
| Generalization guarantees | Performance outside training distribution | Distribution shift is unavoidable in vision | Domain adaptation, augmentation | These mitigate but do not solve the theoretical problem |
| Multi-sensor fusion theory | How heterogeneous sensors should be fused optimally | Requires unified state-space and noise models | Late fusion, learned fusion | Fusion is learned empirically, not derived |
| Inverse problems under learning | Whether learned inverses are stable and well-posed | Inverse problems are often ill-posed by nature | Implicit regularization via networks | Best papers rely on empirical stability, not proofs |
| Long-horizon reasoning | Understanding scenes over long time spans | Error accumulation and memory limits | Sliding windows, recurrent modules | Best papers focus on short-term tasks |
| Physical consistency | Ensuring predictions obey physical laws | Physics constraints are hard to encode differentiably | Physics-informed losses | Usually approximate and task-specific |
| Evaluation under ambiguity | Multiple valid interpretations of the same scene | Ground truth often assumes a single answer | Pick one label or average | Best papers collapse ambiguity instead of modeling it |
B. Key Meta-Observation (Critical)
| Observation | Explanation |
|---|---|
| These are not βmissing tricksβ | They are structural modeling problems |
| They predate deep learning | Many come from 1950β2000 math/physics |
| Best papers optimize within broken assumptions | They rarely question the assumptions themselves |
| Solving them reduces leaderboard gains | Which is why incentives avoid them |
| They require saying βthis task is ill-posedβ | CVPR culture discourages this |
Status Overview
| Company | Primary Motivation | What They Do Today | What They Explicitly Do NOT Do | Why They Stop There |
|---|---|---|---|---|
| Apple | Product reliability, AR UX | Factory calibration, tight hardware control, limited runtime correction (ISP, ARKit) | No general online re-calibration of intrinsics/extrinsics | System risk, cost, consumer tolerance, closed ecosystem |
| Developer platform, ML-first vision | ARCore runtime estimation, ML-based geometric compensation | No metric-accurate, device-level self-calibration | Prioritizes ML robustness over geometric correctness | |
| Meta | Social AR, avatar realism | Per-session tracking calibration for AR effects | No persistent, long-term calibration across time | Focus on perceptual realism, not physical accuracy |
| Microsoft | Enterprise AR, robotics | Device-specific calibration pipelines (HoloLens) | No general-purpose consumer-scale solution | Enterprise-only scale, controlled hardware |
| Amazon | Commerce, logistics | Robotics calibration in warehouses | No mobile-device-facing solution | Domain-specific, not platform-oriented |
| Qualcomm | Chip enablement | ISP tuning, sensor fusion hooks | No system-level calibration ownership | Sells silicon, not end-to-end systems |
What Is Fundamentally Missing
| Missing Capability | Status |
|---|---|
| Online intrinsic re-estimation | Not shipped by any |
| Target-free calibration | Research-only |
| Long-term temporal consistency | Not addressed |
| Cross-camera self-consistency | Partial hacks only |
| System-level ownership | No clear owner |
How ML Makes Camera Errors More Dangerous
| Stage | What Happens | Why It Is Dangerous |
|---|---|---|
| Geometry is wrong | Camera intrinsics or extrinsics drift | The physical reference frame is no longer correct |
| ML compensates | Neural networks adapt and mask errors | Errors are hidden instead of detected |
| System appears to work | Outputs look plausible to users and metrics | No obvious failure signal is triggered |
| Metrics pass | Task-level KPIs remain within tolerance | Validation does not detect geometric inconsistency |
| Lost signal | Geometric consistency is no longer enforced | The system loses its primary correctness alarm |
| Result | System does not know it is wrong | Errors become silent, global, and compounding |
Camera as the Global Reference Frame in Vision Systems
| Module | What It Depends On |
|---|---|
| SLAM | Camera intrinsics and extrinsics |
| Augmented Reality (AR) | Camera coordinate frame |
| Depth / Stereo | Multi-camera geometric consistency |
| Sensor Fusion | CameraβIMU extrinsic calibration |
| Robotics | Mapping between camera frame and world frame |
Camera Calibration Core Definition
| Concept | Meaning |
|---|---|
| Calibration | Estimating the mapping between 3D world coordinates and 2D image measurements |
| Intrinsics | Parameters internal to the camera (focal length, principal point, distortion) |
| Extrinsics | Rigid transformation between camera and world (or other sensors) |
| Camera Registration | Estimation of the rigid pose (rotation & translation) of a camera relative to another reference (e.g., another camera, a world frame, or a sensor) |
| Assumption (classical) | Camera parameters are static and known |