Loading 3D visualization...
Note: Distant scene points have been removed to speed up loading. Full depth information is retained in the SynthVerse dataset.
Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations.
To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects.
SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts.
Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.
✓ indicates that the original dataset already provides the corresponding signal. ✗ is that the original dataset does not include this information, and we also do not provide it. 🙂 represents that the original dataset lacks this information, but we additionally annotate and incorporate it in our SynthVerse.
| SynthVerse | Raw Data | RGB | Depth | Instance Masks | Camera Poses | Point Trajectories | Point Visibility |
|---|---|---|---|---|---|---|---|
| Embodied | GenManip | ✓ | ✓ | ✓ | ✓ | 🙂 | 🙂 |
| Human | TCHCDR | ✓ | ✓ | ✓ | 🙂 | 🙂 | 🙂 |
| Animal | Truebones | ✓ | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 |
| AnyTop | ✓ | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 | |
| Objects | OmniObject3D | ✓ | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 |
| PartNet-M | ✓ | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 | |
| Infinite-M | ✓ | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 | |
| BlendSwap | ✓ | 🙂 | 🙂 | ✓ | 🙂 | 🙂 | |
| Blender-Demo | ✓ | 🙂 | 🙂 | ✓ | 🙂 | 🙂 | |
| Film | Blender-Studio | ✓ | 🙂 | ✗ | ✓ | 🙂 | 🙂 |
| Navigation | InternScenes | ✓ | ✓ | ✗ | ✓ | 🙂 | 🙂 |
| Mixamo | ✓ | 🙂 | ✗ | 🙂 | 🙂 | 🙂 | |
| Interaction | Hot3D | ✓ | 🙂 | ✓ | ✓ | ✓ | 🙂 |
| HTML | ✓ | 🙂 | 🙂 | 🙂 | 🙂 | 🙂 |
Shot-level production content with diverse assets and complex dynamics
VLA-driven robot interactions with diverse objects in realistic environments
Indoor scene traversal with dynamic actors and egocentric viewpoints
20K characters with 2K motion sequences covering diverse activities
75 species with 20+ motion patterns each (attacking, jumping, etc.)
High-fidelity hand-object interactions with MANO hand model
Multi-joint structures with physics-based simulation
Flowers, garments, and other non-rigid objects
For 2D trackers CoTracker series, we lift predicted 2D points to 3D using GT depth and GT camera pose. All metrics use higher-is-better convention.
| Methods | SynthVerse-Nav | SynthVerse-Human | SynthVerse-Animal | SynthVerse-Objects | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AJ3D | APD3D | AJ2D | APD2D | OA | AJ3D | APD3D | AJ2D | APD2D | OA | AJ3D | APD3D | AJ2D | APD2D | OA | AJ3D | APD3D | AJ2D | APD2D | OA | |
| CoTracker | 19.6 | 24.1 | 38.0 | 43.7 | 62.3 | 25.2 | 24.8 | 51.0 | 68.0 | 84.4 | 12.6 | 18.3 | 49.7 | 66.2 | 76.3 | 28.9 | 38.3 | 46.1 | 60.2 | 82.4 |
| CoTracker3 | 18.7 | 23.4 | 37.0 | 45.4 | 64.8 | 27.1 | 39.2 | 53.1 | 74.8 | 83.3 | 14.0 | 21.0 | 53.6 | 73.5 | 76.8 | 26.7 | 39.4 | 39.2 | 57.8 | 63.4 |
| SpatialTrackerV2-offline | 19.4 | 21.1 | 37.5 | 45.5 | 74.3 | 17.7 | 25.5 | 30.6 | 43.7 | 62.1 | 21.2 | 30.1 | 42.5 | 55.5 | 69.3 | 30.4 | 43.4 | 25.7 | 34.2 | 55.0 |
| SpatialTrackerV2-online | 20.5 | 23.5 | 38.0 | 46.6 | 78.8 | 22.8 | 30.1 | 40.2 | 53.2 | 81.7 | 23.3 | 31.8 | 48.7 | 59.4 | 86.4 | 30.1 | 37.0 | 30.9 | 40.3 | 59.1 |
| TAPIP3D-camera | 12.3 | 22.5 | 31.2 | 52.4 | 82.8 | 28.4 | 42.4 | 45.3 | 66.7 | 81.0 | 43.0 | 56.8 | 54.4 | 72.9 | 82.0 | 21.4 | 38.8 | 33.2 | 61.8 | 65.4 |
| TAPIP3D-world | 13.6 | 25.1 | 33.0 | 54.9 | 82.9 | 28.1 | 42.5 | 44.7 | 66.9 | 81.2 | 43.2 | 56.6 | 54.3 | 72.5 | 82.2 | 21.8 | 38.7 | 33.3 | 61.4 | 66.3 |
| Methods | SynthVerse-Embodied | SynthVerse-Film | SynthVerse-Interaction | SynthVerse-mAverage | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AJ3D | APD3D | AJ2D | APD2D | OA | AJ3D | APD3D | AJ2D | APD2D | OA | AJ3D | APD3D | AJ2D | APD2D | OA | AJ3D | APD3D | AJ2D | APD2D | OA | |
| CoTracker | 22.9 | 27.3 | 35.8 | 41.3 | 56.9 | 4.7 | 7.3 | 51.4 | 65.3 | 83.2 | 22.6 | 43.6 | 45.0 | 81.8 | 58.5 | 19.5 | 26.2 | 45.3 | 60.9 | 72.0 |
| CoTracker3 | 37.1 | 44.7 | 50.6 | 58.3 | 58.7 | 5.0 | 7.8 | 50.1 | 63.2 | 83.5 | 20.4 | 38.6 | 46.5 | 82.5 | 59.4 | 21.3 | 30.6 | 47.2 | 65.1 | 70.0 |
| SpatialTrackerV2-offline | 25.5 | 30.2 | 42.9 | 47.9 | 78.0 | 9.8 | 15.0 | 48.3 | 59.2 | 78.7 | 13.3 | 24.6 | 31.3 | 49.5 | 55.6 | 19.6 | 27.1 | 37.0 | 47.9 | 67.6 |
| SpatialTrackerV2-online | 25.2 | 31.8 | 43.5 | 51.4 | 82.3 | 10.2 | 15.3 | 56.0 | 66.6 | 83.1 | 23.5 | 31.7 | 42.6 | 54.9 | 59.1 | 22.2 | 28.7 | 42.8 | 53.2 | 75.8 |
| TAPIP3D-camera | 24.8 | 32.2 | 36.9 | 52.2 | 79.0 | 41.9 | 53.1 | 56.9 | 71.7 | 81.5 | 61.5 | 77.8 | 71.7 | 90.4 | 84.2 | 33.3 | 46.2 | 47.1 | 66.9 | 79.4 |
| TAPIP3D-world | 24.5 | 31.5 | 37.4 | 52.2 | 79.4 | 41.9 | 52.7 | 57.6 | 72.3 | 82.3 | 60.1 | 77.5 | 71.4 | 90.2 | 84.0 | 33.3 | 46.4 | 47.4 | 67.2 | 79.8 |
@article{zhao2026SythnVerse,
title={SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking},
author={Weiguang Zhao and Haoran Xu and Xingyu Miao and Qin Zhao and Rui Zhang and Kaizhu Huang and Ning Gao and Peizhou Cao and Mingze Sun and Mulin Yu and Tao Lu and Linning Xu and Junting Dong and Jiangmiao Pang},
journal={arXiv preprint arXiv:2602.04441},
year={2026}
}