| PointOdyssey v1.2 (Stanford, ICCV 2023) | Synthetic, long-term deformable human/animal scenes, 30 fps, 540Γ960 | Dense depth maps and surface normals per frame | Instance segmentation and visibility mask | Full camera intrinsics and extrinsics | ~2 000 frames per video Γ 159 videos (β 1 h 30 min) | β 300 GB |
| Dynamic Replica v2 (Meta AI, CVPR 2023 DynamicStereo) | Synthetic RGB-D videos of humanβobject interactions | Dense depth maps | Semantic segmentation masks | Full camera poses | ~300 frames (β 10 s) per video Γ 524 videos (β 1 h 15 min training split) | β 1.7 β 1.8 TB |
| Kubric (Google Research 2022) | Procedural synthetic multi-object MVS generator | Per-pixel depth and normals | Segmentation and instance masks | Intrinsics and extrinsics available | Short clips (β 24 frames average) | β 1 TB |
| MPI-Sintel (Complete) | Blender-rendered movie frames with optical-flow ground truth | Depth and optical flow maps | Semantic layers (albedo, shading) | Camera poses provided | 23 training + 12 test sequences (β 1 min total) | β 5.3 GB |
| TUM-Dynamics (RGB-D SLAM Benchmark 2012) | Real RGB-D video sequences with moving objects | Depth maps from Kinect sensor | No semantic mask provided | Ground-truth camera poses | 2 β 3 min per sequence Γ 15 β 20 sequences (β 1 h total) | β 60 GB |
| ETH3D (SchΓΆps et al., CVPR 2017) | Real multi-view stereo benchmark (static scenes) | Ground-truth depth maps | No semantic mask provided | Calibrated multi-view poses | Static scenes (~10 β 50 images per scene) | β 25 GB |