2026 - Master Thesis - Space

LiDAR Free, 4D Brain Mapping, PRS


Topics


Coding


References


Multi-sensor Input Fusion From Space, Safety Detection


Topics

0. Sensor Modalities and Data Types

Modality Sensor Type Data Representation
Optical Visible-light satellite camera 3-channel RGB image (8-bit)
SAR Synthetic Aperture Radar 1-channel SAR image (32-bit float)


1. Maritime Search and Rescue

Optical satellite images
+ SAR satellite images
β†’ Ship Detection
β†’ Ship Re-Identification (ReID)
β†’ Trajectory generation & route prediction
Platform Strength Fundamental Limitation
GEO satellites Wide coverage, high temporal resolution Low spatial resolution
Video satellites High spatial & temporal resolution Short duration, small coverage
AIS-based systems Accurate identity info Only works for cooperative targets
Axis Examples
Sensors Optical, SAR, LiDAR, multispectral
Tasks Detection, ReID, tracking, mapping
Scale Local β†’ Global
Time Snapshot β†’ Long-term monitoring


2. Input Data Type

Modality Data Type Format
Optical RGB image 3-channel, 8-bit TIF
SAR Radar backscatter 1-channel, 32-bit float TIF
Geometry Ship size (derived) Numeric vector (length, width, aspect ratio)


3. Fusion Space

Optical image ─┐
               β”œβ”€ Dual-head tokenizer β†’ Shared Transformer Encoder β†’ Unified embedding
SAR image     β”€β”˜


4. Output Data

Stage Output Used
ReID Feature distance matrix
Tracking Identity association
Trajectory Time-ordered identity matches


A Dynamic Camera with Multi-modal Input Signal Fusion

          Human perception
        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
        β”‚  Vestibular      β”‚
        β”‚  Vision          β”‚
        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β–²
                 β”‚
    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
    β”‚ Wearable System Estimation  β”‚
    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                 β”‚
 β”Œβ”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”
 β”‚ Cameraβ”‚  IMU   β”‚ Eye    β”‚ Depth  β”‚ Others β”‚
 β”‚       β”‚        β”‚ trackerβ”‚ / ToF  β”‚        β”‚
 β””β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”˜


Core Formulation: Bayesian Multi-Modal Sensor Fusion

Latent State Definition

  • At time step t, the latent state is defined as: $x_t = { T_t, \theta, \psi_t }$

  • where $T_t$ denotes the device pose, $\theta$ represents the calibration parameters shared across time, and $\psi_t$ denotes user-centric latent variables.

Multi-Modal Observations

  • Given heterogeneous sensor measurements at time t: $z_t = { z_t^{cam}, z_t^{imu}, z_t^{eye} }$

  • where observations are obtained from the camera, IMU, and eye-tracking modalities.

Bayesian Fusion Objective

  • Multi-modal fusion is defined as inference over the joint posterior: $p(x_{1:T} \mid z_{1:T})$

  • Using the Markov assumption and conditional independence of observations, the posterior factorizes as: $p(x_{1:T} \mid z_{1:T}) \propto \prod_{t=1}^{T} p(z_t \mid x_t)\, p(x_t \mid x_{t-1})$

Multi-Modal Likelihood Factorization

  • Assuming conditional independence between sensor modalities given the latent state: $p(z_t \mid x_t) = p(z_t^{cam} \mid x_t)\, p(z_t^{imu} \mid x_t)\, p(z_t^{eye} \mid x_t)$

State Transition Model

  • The temporal evolution of the latent state is modeled as: $p(x_t \mid x_{t-1}) = p(T_t \mid T_{t-1})\, p(\psi_t \mid \psi_{t-1})\, p(\theta)$

  • where $\theta$ is treated as a time-invariant latent variable, $p(\theta)$ enforces temporal consistency of calibration parameters.

Interpretation

  • Fusion thus corresponds to Bayesian state estimation under uncertainty, where heterogeneous sensor observations impose probabilistic constraints on a shared latent state evolving over time. Calibration parameters are inferred jointly with pose and user states, enabling online self-calibration.

Sensor Models

  • $z_t^{imu} = h_{imu}(T_{t-1}, T_t) + \epsilon_{imu}$
  • $z_t^{cam} = h_{cam}(T_t, \theta) + \epsilon_{cam}$
  • $z_t^{eye} = h_{eye}(T_t, \psi_t) + \epsilon_{eye}$

Filtering Approximation

For online inference, we approximate the posterior using Bayesian filtering.

  • Prediction: $p(x_t \mid z_{1:t-1}) = \int p(x_t \mid x_{t-1}) p(x_{t-1} \mid z_{1:t-1}) dx_{t-1}$
  • Update: $p(x_t \mid z_{1:t}) \propto p(z_t \mid x_t) p(x_t \mid z_{1:t-1})$


Practical Filtering Choices under Self-Calibrated Camera Constraints

Method Inference Principle Handles High-Dimensional State Real-Time / Online Geometric Interpretability Typical Failure Mode Suitability for Your Pipeline
Full Bayesian Filtering Exact posterior inference $p(x_t \mid y_{1:t})$ No (intractable) No Theoretically yes Intractable integrals βœ— (theoretical only)
Particle Filter Sampling-based Bayesian inference Poor (curse of dimensionality) No Weak (implicit geometry) Sample degeneracy βœ—
Kalman Filter (KF) Linear-Gaussian Bayesian inference Moderate Yes Strong (explicit states) Model mismatch βœ“ (baseline)
Extended Kalman Filter (EKF) Local linearization of nonlinear models Moderate–High Yes Strong Linearization error βœ“βœ“
Unscented Kalman Filter (UKF) Sigma-point approximation Moderate Borderline Strong Computational cost β–³
Information Filter KF in information (precision) form High Yes Strong Numerical instability βœ“
Factor Graph / Smoothing MAP estimation over state graph High Semi-online Very strong Latency / memory βœ“βœ“ (geometry modules)
Continuous-Time Filters Trajectory as continuous function High Yes Strong Model complexity βœ“βœ“
Variational Bayesian Filters Approximate posterior optimization High No Weak–Moderate Approximation bias βœ—
Neural / Learned Filters Learned belief update High Yes Weak (opaque) Geometry drift βœ— (as core filter)


Method Selection Is Constraint-Driven, Not Aesthetic

Hard Constraint Practical Interpretation Technical Implication
Online, real-time, low latency The system runs on an wearable device worn by a human user. End-to-end latency above tens of milliseconds leads to motion sickness and unacceptable user experience. Any method that is offline, batch-only, or exhibits unstable latency is infeasible and must be excluded.
High-dimensional continuous state space The system state includes not only camera pose but also velocity, IMU biases, camera intrinsics and extrinsics, and temporal offsets between sensors. The resulting state space is high-dimensional, continuous, and strongly nonlinear, making general inference methods computationally intractable.
Geometric honesty and interpretability Solutions must be physically and geometrically valid, not merely visually plausible. Calibration parameters must correspond to real camera models and be diagnosable when errors occur. Methods that produce visually convincing but geometrically inconsistent results are unacceptable. Explicit state representation and interpretable uncertainty are required.


Why Gaussian

  • For Closure Under Bayesian Operations
  • Bayesian filtering requires two fundamental operations that are applied recursively over time.

  • Prediction
  • The prediction step propagates the belief forward in time using the system dynamics:

$p(x_t \mid z_{1:t-1}) = \int p(x_t \mid x_{t-1}) p(x_{t-1} \mid z_{1:t-1}) dx_{t-1}$

  • Update
  • The update step incorporates the new observation into the predicted belief:

$p(x_t \mid z_{1:t}) \propto p(z_t \mid x_t)\, p(x_t \mid z_{1:t-1})$


  • Gaussian distributions possess a crucial closure property under these Bayesian operations:
    • The product of two Gaussian distributions is Gaussian.
    • The marginalization of a joint Gaussian distribution is Gaussian.
  • As a consequence:
    • The prediction step preserves Gaussianity.
    • The update step preserves Gaussianity.
  • Without this closure property, the posterior distribution does not remain in a tractable functional family, and Bayesian filtering becomes analytically intractable.


Bayesian Filter、Kalman Filter, Gaussian Distribution

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Bayesian Filtering              β”‚
β”‚                                               β”‚
β”‚  p(x_t | z_{1:t}) ∝ p(z_t | x_t) p(x_t | z_{1:t-1}) β”‚
β”‚                                               β”‚
β”‚  β€’ General probabilistic inference framework  β”‚
β”‚  β€’ Arbitrary distributions                    β”‚
β”‚  β€’ Arbitrary nonlinear dynamics               β”‚
β”‚  β€’ Arbitrary observation models               β”‚
β”‚                                               β”‚
β”‚        (Intractable in general)               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β”‚  Gaussian assumption
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚            Kalman-style Filtering             β”‚
β”‚                                               β”‚
β”‚  Assumption:                                  β”‚
β”‚  p(x_t | z_{1:t}) β‰ˆ 𝒩(ΞΌ_t, Ξ£_t)              β”‚
β”‚                                               β”‚
β”‚  β€’ Posterior represented only by mean + cov   β”‚
β”‚  β€’ Recursive closed-form updates              β”‚
β”‚  β€’ Efficient and online                       β”‚
β”‚                                               β”‚
β”‚  Includes:                                    β”‚
β”‚   - Kalman Filter (linear)                    β”‚
β”‚   - EKF (local linearization)                 β”‚
β”‚   - UKF (sigma-point)                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                        β”‚
                        β”‚  Linear model + Gaussian noise
                        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚               Kalman Filter                   β”‚
β”‚                                               β”‚
β”‚  x_t = A x_{t-1} + w_t ,   w_t ~ 𝒩(0, Q)     β”‚
β”‚  z_t = H x_t     + v_t ,   v_t ~ 𝒩(0, R)     β”‚
β”‚                                               β”‚
β”‚  β€’ Exact Bayesian inference                   β”‚
β”‚  β€’ Optimal under linear-Gaussian assumptions  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜


For Each Time Step

Time t-1 belief                    Prediction                    Update
(posterior at t-1)                 (motion model)               (sensor fusion)

   p(x_{t-1}|z_{1:t-1})             p(x_t|z_{1:t-1})             p(x_t|z_{1:t})
   ~ 𝒩(ΞΌ_{t-1}, Ξ£_{t-1})     β†’       ~ 𝒩(ΞΌ_t^-, Ξ£_t^-)     β†’       ~ 𝒩(ΞΌ_t, Ξ£_t)
                β”‚                              β”‚                              β”‚
                β”‚                              β”‚                              β”‚
                β–Ό                              β–Ό                              β–Ό
        Gaussian belief              Gaussian prediction            Gaussian posterior


Multiple sensors = multiple Gaussian constraints on the same state

                    z_t^cam
                 (camera likelihood)
                        β”‚
                        β–Ό
                   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
z_t^imu ───────▢   β”‚  x_t   β”‚   ◀────── z_t^eye
(IMU likelihood)   β”‚ latent β”‚   (eye-tracking likelihood)
                   β”‚ state  β”‚
                   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜


Best Normalization

Data Distribution Characteristics Method Formula Core Assumption
Gaussian-like Distribution Standard z-score normalization $z = \dfrac{x - \mu}{\sigma}$ Most data points are concentrated near the mean; few outliers exist.
Skewed or Heavy-Tailed Distribution Robust z-score (Median + MAD) $z = \dfrac{x - \mathrm{median}}{\mathrm{MAD}}$ Extreme values exist; the median provides a more stable estimate.
Bounded Values (0–1, Ratio-type Data) Min–Max normalization $x’ = \dfrac{x - x_{\min}}{x_{\max} - x_{\min}}$ Data lies within a fixed range; preserving proportional relationships is important.
Log-Normal or Multiplicative Noise Data Log transform + z-score $\log(x)$ or $\log(1 + x);\rightarrow;$ z-score Noise varies multiplicatively; log transformation linearizes it.
Mixed Noise or Asymmetric Distributions Quantile normalization / Rank transform $x \mapsto \mathrm{rank}(x)$ or quantile mapping Exact values are less important; relative ordering matters.


Brain Signals (Why Median + MAD)

Property Meaning Impact
Non-stationary The mean varies across time and sessions Mean and standard deviation become unstable
Heavy-tailed distribution Strong artifacts or high-amplitude spikes Standard deviation is inflated by outliers
Weak signal + mixed noise High-frequency oscillations + low-frequency drift Large mean variation, clear skewness
Inter-channel variation Each sensor has different sensitivity Requires independent per-channel normalization


A. Unresolved Core Problems in Modern CVPR (Post-Deep Learning Era)

Problem Area What the Problem Really Is Why It Is Still Unsolved How CVPR Papers Currently Cope Why Best Papers Still Miss It
Continuous-time modeling Vision models are fundamentally discrete, but the world is continuous in time Continuous-time inference is mathematically harder; requires differential equations and observability theory Discretization, frame aggregation, splines, heuristic interpolation Best papers optimize within discretized assumptions instead of fixing the time model
Temporal causality Models confuse correlation across frames with causal structure Causality requires intervention and counterfactual reasoning, not passive data Self-supervision, temporal contrastive losses These methods improve prediction, not causal understanding
Identifiability Whether the true scene/state is uniquely recoverable from data Identifiability depends on geometry, noise, and sensor configuration Overparameterization hides non-identifiability Best papers report accuracy, not whether the solution is meaningful
Geometry–learning consistency Learned representations often violate geometric invariants Neural networks lack built-in structure preservation Add geometry as loss terms or regularizers Geometry is treated as decoration, not a first-class constraint
Probabilistic correctness Most β€œuncertainty” estimates are not valid probabilities Proper probabilistic modeling is expensive and restrictive Softmax scores, Monte Carlo dropout Best papers optimize calibration metrics without true probabilistic guarantees
Sensor modeling Real sensors are nonlinear, asynchronous, and imperfect Accurate sensor models complicate learning pipelines Synthetic data, simplified sensor assumptions Papers assume idealized sensors to keep benchmarks manageable
Scale vs. meaning Scaling improves performance without improving understanding Optimization rewards accuracy, not interpretability Larger models, more data Best papers often demonstrate scale, not conceptual progress
Benchmark validity Benchmarks measure proxies, not the intended task Ground truth is often ill-defined or biased Dataset curation and metric tuning Best papers win benchmarks without questioning what they measure
Failure characterization Knowing when and why a model fails Requires negative results and adversarial analysis Ignore rare or hard cases Best papers are structurally biased against failure analysis
Generalization guarantees Performance outside training distribution Distribution shift is unavoidable in vision Domain adaptation, augmentation These mitigate but do not solve the theoretical problem
Multi-sensor fusion theory How heterogeneous sensors should be fused optimally Requires unified state-space and noise models Late fusion, learned fusion Fusion is learned empirically, not derived
Inverse problems under learning Whether learned inverses are stable and well-posed Inverse problems are often ill-posed by nature Implicit regularization via networks Best papers rely on empirical stability, not proofs
Long-horizon reasoning Understanding scenes over long time spans Error accumulation and memory limits Sliding windows, recurrent modules Best papers focus on short-term tasks
Physical consistency Ensuring predictions obey physical laws Physics constraints are hard to encode differentiably Physics-informed losses Usually approximate and task-specific
Evaluation under ambiguity Multiple valid interpretations of the same scene Ground truth often assumes a single answer Pick one label or average Best papers collapse ambiguity instead of modeling it


B. Key Meta-Observation (Critical)

Observation Explanation
These are not β€œmissing tricks” They are structural modeling problems
They predate deep learning Many come from 1950–2000 math/physics
Best papers optimize within broken assumptions They rarely question the assumptions themselves
Solving them reduces leaderboard gains Which is why incentives avoid them
They require saying β€œthis task is ill-posed” CVPR culture discourages this


Status Overview

Company Primary Motivation What They Do Today What They Explicitly Do NOT Do Why They Stop There
Apple Product reliability, AR UX Factory calibration, tight hardware control, limited runtime correction (ISP, ARKit) No general online re-calibration of intrinsics/extrinsics System risk, cost, consumer tolerance, closed ecosystem
Google Developer platform, ML-first vision ARCore runtime estimation, ML-based geometric compensation No metric-accurate, device-level self-calibration Prioritizes ML robustness over geometric correctness
Meta Social AR, avatar realism Per-session tracking calibration for AR effects No persistent, long-term calibration across time Focus on perceptual realism, not physical accuracy
Microsoft Enterprise AR, robotics Device-specific calibration pipelines (HoloLens) No general-purpose consumer-scale solution Enterprise-only scale, controlled hardware
Amazon Commerce, logistics Robotics calibration in warehouses No mobile-device-facing solution Domain-specific, not platform-oriented
Qualcomm Chip enablement ISP tuning, sensor fusion hooks No system-level calibration ownership Sells silicon, not end-to-end systems


What Is Fundamentally Missing

Missing Capability Status
Online intrinsic re-estimation Not shipped by any
Target-free calibration Research-only
Long-term temporal consistency Not addressed
Cross-camera self-consistency Partial hacks only
System-level ownership No clear owner


How ML Makes Camera Errors More Dangerous

Stage What Happens Why It Is Dangerous
Geometry is wrong Camera intrinsics or extrinsics drift The physical reference frame is no longer correct
ML compensates Neural networks adapt and mask errors Errors are hidden instead of detected
System appears to work Outputs look plausible to users and metrics No obvious failure signal is triggered
Metrics pass Task-level KPIs remain within tolerance Validation does not detect geometric inconsistency
Lost signal Geometric consistency is no longer enforced The system loses its primary correctness alarm
Result System does not know it is wrong Errors become silent, global, and compounding


Camera as the Global Reference Frame in Vision Systems

Module What It Depends On
SLAM Camera intrinsics and extrinsics
Augmented Reality (AR) Camera coordinate frame
Depth / Stereo Multi-camera geometric consistency
Sensor Fusion Camera–IMU extrinsic calibration
Robotics Mapping between camera frame and world frame


Camera Calibration Core Definition

Concept Meaning
Calibration Estimating the mapping between 3D world coordinates and 2D image measurements
Intrinsics Parameters internal to the camera (focal length, principal point, distortion)
Extrinsics Rigid transformation between camera and world (or other sensors)
Camera Registration Estimation of the rigid pose (rotation & translation) of a camera relative to another reference (e.g., another camera, a world frame, or a sensor)
Assumption (classical) Camera parameters are static and known




References