2026 - Master Thesis - Diffusion, Robotics, Security
Prototype, UZH AI, PRS
- π΅ 2026, itβs perfect
- 2023 - Seeing a Rose in Five Thousand Ways
- 2026 - VGGT-SLAM 2.0: Real-time Dense Feed-forward Scene Reconstruction
- 2026 - π΅
Topics
- 4D Semantic Indoor Map under
LiDAR Free, from Home Robots, Airlines to All - privacy, accuracy, and low-latency, under dark and adverse conditions
- ID Re-identification
Cute Products
- Matic Robots, home, level-5
- Rovex Technologies, hospital
- Flow, city navigation
- π Taalas Inc.
Task Definition
The essence of LiDAR-free technology can be summarized as: transforming sparse measurements of the physical world into dense geometric inference.
- π 2020 - Convolutional Occupancy Networks
- 2022 - Scalable Diffusion Models with Transformers, William Peebles, DiT
References
- 2026 - High-Dimensional Probability
- 2025 - Visual Chronicles: Using Multimodal LLMs to Analyze Massive Collections of Images
Missing Capabilities in Classical SLAM
| Capability | Classical SLAM / VIO | Neural Mapping (NeRF, GS-SLAM) | Action-Conditioned World Models |
|---|---|---|---|
| Estimate robot pose (βWhere am I?β) | β | β | β (implicitly) |
| Represent scene geometry (βWhat does the world look like?β) | β | ββ | β |
| Model long-term spatial consistency | β | ββ | β |
| Predict future robot motion | β (locally, via IMU) | β | β |
| Predict how actions change the world | β | β | β |
| Handle non-rigid, contact-rich rearrangements | β | β | β |
| Support action-level foresight and planning | β | β | β |
Key Evolution of LiDAR-Free 3D Perception
| Stage | Core Papers | Key Innovation | Problem Solved | Technical Essence |
|---|---|---|---|---|
| Stage 1: From 2D to 3D (BEV Revolution) | LSS (Lift, Splat, Shoot), ECCV 2020; BEVDet | Lift 2D image features into 3D space and project them into Birdβs-Eye View (BEV) | Vision systems could not unify multi-camera 2D pixels into a consistent 3D coordinate system for navigation | Establishes the foundational BEV-based architecture used in modern camera-only autonomy systems |
| Stage 2: Global Association with Transformers | BEVFormer, ECCV 2022 | Introduces temporal modeling with Transformer-based spatial-temporal attention | Handles occlusion and improves scene consistency by aggregating information across multiple frames | Enables memory over previous frames, improving robustness and dynamic scene understanding |
| Stage 3: Occupancy-Based Scene Understanding (Occupancy Era) | TPVFormer; Tesla Occupancy Network (2022β2023) | Predicts dense 3D occupancy instead of object bounding boxes | Moves beyond object detection to full spatial understanding of free space and obstacles | Represents the world as a semantic voxel grid, enabling fine-grained geometry and material-level reasoning |
Tool Kits for LiDAR-Free
| Layer | Technical Components | Functional Role and Research Application |
|---|---|---|
| Distributed Orchestration | Ray (Ray Core, Ray Data) | Acts as the distributed runtime engine for asynchronous multi-camera stream ingestion. Enables zero-copy data sharing via the Plasma object store and manages cross-node task scheduling for real-time multi-object tracking. |
| Computational Framework | JAX / XLA | Provides the functional programming foundation for high-performance numerical computing. Leverages the XLA compiler to optimize 4D trajectory estimation, uncertainty-aware bundle adjustment, and spatiotemporal manifold operations on GPU/TPU clusters. |
| Model Composition | dm-haiku | Serves as the neural network library for JAX. Used to implement the Cross-Modal Transformer, degradation-aware encoders, and memory modules for long-term Re-Identification with explicit parameter management and state handling. |
| Distributed Sharding | Mezzanine | Enables fine-grained tensor partitioning across heterogeneous compute nodes. Critical for scaling large Bundle Adjustment Hessian matrices and handling dynamic sharding when the number of tracked targets varies over time. |
| Structural Inspection | Penzai | Provides model inspection and structural modification tools for large foundation models. Used to analyze latent spatiotemporal representations and selectively modify attention heads within the transformer backbone. |
| 3D Vision and Geometry | PyTorch3D / COLMAP | Supports differentiable 3D geometry operations including PnP solvers, triangulation, reprojection error computation, and camera pose refinement. COLMAP supplies baseline structure-from-motion pose initialization for multi-view geometry. |
| Robotic Middleware | ROS2 / C++ / Rust | Handles low-latency message passing between drone hardware and compute clusters. Rust and C++ are used for safety-critical and high-concurrency modules such as temporal synchronization, RocSync integration, and real-time control loops. |
| Simulation and Synthesis | Unreal Engine / SUMO / Blender | Generates high-fidelity digital twin environments with synchronized multi-modal ground truth including RGB, depth, trajectories, and timestamps. Supports training and evaluation under adverse weather and long-tail navigation scenarios. |
| Hardware Acceleration | CUDA / Linux | Provides low-level GPU acceleration and kernel-level resource management for O(T) transformer inference, real-time backend optimization, and parallelized geometric solvers. |
Detection and Tracking Algorithms
| Layer | Technical Components | Application |
|---|---|---|
| Detection Backbone | RT-DETR | Transformer-based object detection producing spatially consistent bounding boxes and feature embeddings for downstream multi-view association. |
| Temporal Association | Query-Based Tracking (e.g., MOTR-style) | Maintains persistent object identities across frames using learnable spatiotemporal queries instead of heuristic matching. |
| Multi-View Correspondence | Epipolar Geometry | Filters cross-camera matches using the Fundamental Matrix to enforce geometric consistency before triangulation. |
| 3D Reconstruction | Triangulation + PnP | Recovers metric 3D positions from validated multi-view 2D detections and refines camera pose estimates. |
| Global Optimization | Bundle Adjustment | Minimizes reprojection error jointly over camera poses and object trajectories to achieve globally consistent 4D reconstruction. |
| Dynamic Motion Modeling | Motion Decomposition | Separates object motion from camera motion to stabilize optimization under dynamic scenes. |
| Spatiotemporal Refinement | Uncertainty-Aware Optimization | Weighs correspondences by confidence scores to improve robustness under occlusion, noise, and adverse weather conditions. |
| Identity Persistence | Cross-Camera Re-Identification | Uses learned feature embeddings to maintain consistent object identities across disjoint camera views. |