SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking

Weiguang Zhao1,4,*, Haoran Xu2,*, Xingyu Miao3, Qin Zhao2, Rui Zhang4,†, Kaizhu Huang5,†, Ning Gao6,7, Peizhou Cao6,8, Mingze Sun9, Mulin Yu6, Tao Lu6, Linning Xu6,10, Junting Dong6,†, ✨, Jiangmiao Pang6

1University of Liverpool, 2Zhejiang University, 3Durham University, 4Xi'an Jiaotong-Liverpool University,

5Duke Kunshan University, 6Shanghai AI Laboratory, 7Xi'an Jiaotong University,

8Beihang University, 9Tsinghua University, 10The Chinese University of Hong Kong

*Equal contribution, Corresponding Author, Project Leader

Explore SynthVerse

Interactive Demo

Loading 3D visualization...

Note: Distant scene points have been removed to speed up loading. Full depth information is retained in the SynthVerse dataset.

Embodied Embodied
Flower Flower
Film Film
Hand Hand

Visualization Results

Abstract

Point tracking aims to follow visual points through complex motion, occlusion, and viewpoint changes, and has advanced rapidly with modern foundation models. Yet progress toward general point tracking remains constrained by limited high-quality data, as existing datasets often provide insufficient diversity and imperfect trajectory annotations.

To this end, we introduce SynthVerse, a large-scale, diverse synthetic dataset specifically designed for point tracking. SynthVerse includes several new domains and object types missing from existing synthetic datasets, such as animated-film-style content, embodied manipulation, scene navigation, and articulated objects.

SynthVerse substantially expands dataset diversity by covering a broader range of object categories and providing high-quality dynamic motions and interactions, enabling more robust training and evaluation for general point tracking. In addition, we establish a highly diverse point tracking benchmark to systematically evaluate state-of-the-art methods under broader domain shifts.

Extensive experiments and analyses demonstrate that training with SynthVerse yields consistent improvements in generalization and reveal limitations of existing trackers under diverse settings.

Dataset Overview

Dataset Statistics

5,816K
Training Frames
59K
Test Frames
48K
Sequences
7+
Scene Types

Source Data

indicates that the original dataset already provides the corresponding signal. is that the original dataset does not include this information, and we also do not provide it. 🙂 represents that the original dataset lacks this information, but we additionally annotate and incorporate it in our SynthVerse.

SynthVerse Raw Data RGB Depth Instance Masks Camera Poses Point Trajectories Point Visibility
Embodied GenManip 🙂 🙂
Human TCHCDR 🙂 🙂 🙂
Animal Truebones 🙂 🙂 🙂 🙂 🙂
AnyTop 🙂 🙂 🙂 🙂 🙂
Objects OmniObject3D 🙂 🙂 🙂 🙂 🙂
PartNet-M 🙂 🙂 🙂 🙂 🙂
Infinite-M 🙂 🙂 🙂 🙂 🙂
BlendSwap 🙂 🙂 🙂 🙂
Blender-Demo 🙂 🙂 🙂 🙂
Film Blender-Studio 🙂 🙂 🙂
Navigation InternScenes 🙂 🙂
Mixamo 🙂 🙂 🙂 🙂
Interaction Hot3D 🙂 🙂
HTML 🙂 🙂 🙂 🙂 🙂

Data Generation Pipeline

SynthVerse Data Generation Pipeline

Scene Types

🎬 Animated Films

Shot-level production content with diverse assets and complex dynamics

🤖 Embodied Manipulation

VLA-driven robot interactions with diverse objects in realistic environments

🚶 Navigation

Indoor scene traversal with dynamic actors and egocentric viewpoints

👤 Human Motion

20K characters with 2K motion sequences covering diverse activities

🦊 Animal Motion

75 species with 20+ motion patterns each (attacking, jumping, etc.)

✋ Hand Interaction

High-fidelity hand-object interactions with MANO hand model

🔧 Articulated Objects

Multi-joint structures with physics-based simulation

🌸 Deformable Objects

Flowers, garments, and other non-rigid objects

Benchmark Results

SynthVerse Benchmark Results

For 2D trackers CoTracker series, we lift predicted 2D points to 3D using GT depth and GT camera pose. All metrics use higher-is-better convention.

Methods SynthVerse-Nav SynthVerse-Human SynthVerse-Animal SynthVerse-Objects
AJ3DAPD3DAJ2DAPD2DOA AJ3DAPD3DAJ2DAPD2DOA AJ3DAPD3DAJ2DAPD2DOA AJ3DAPD3DAJ2DAPD2DOA
CoTracker 19.624.138.043.762.3 25.224.851.068.084.4 12.618.349.766.276.3 28.938.346.160.282.4
CoTracker3 18.723.437.045.464.8 27.139.253.174.883.3 14.021.053.673.576.8 26.739.439.257.863.4
SpatialTrackerV2-offline 19.421.137.545.574.3 17.725.530.643.762.1 21.230.142.555.569.3 30.443.425.734.255.0
SpatialTrackerV2-online 20.523.538.046.678.8 22.830.140.253.281.7 23.331.848.759.486.4 30.137.030.940.359.1
TAPIP3D-camera 12.322.531.252.482.8 28.442.445.366.781.0 43.056.854.472.982.0 21.438.833.261.865.4
TAPIP3D-world 13.625.133.054.982.9 28.142.544.766.981.2 43.256.654.372.582.2 21.838.733.361.466.3
Methods SynthVerse-Embodied SynthVerse-Film SynthVerse-Interaction SynthVerse-mAverage
AJ3DAPD3DAJ2DAPD2DOA AJ3DAPD3DAJ2DAPD2DOA AJ3DAPD3DAJ2DAPD2DOA AJ3DAPD3DAJ2DAPD2DOA
CoTracker 22.927.335.841.356.9 4.77.351.465.383.2 22.643.645.081.858.5 19.526.245.360.972.0
CoTracker3 37.144.750.658.358.7 5.07.850.163.283.5 20.438.646.582.559.4 21.330.647.265.170.0
SpatialTrackerV2-offline 25.530.242.947.978.0 9.815.048.359.278.7 13.324.631.349.555.6 19.627.137.047.967.6
SpatialTrackerV2-online 25.231.843.551.482.3 10.215.356.066.683.1 23.531.742.654.959.1 22.228.742.853.275.8
TAPIP3D-camera 24.832.236.952.279.0 41.953.156.971.781.5 61.577.871.790.484.2 33.346.247.166.979.4
TAPIP3D-world 24.531.537.452.279.4 41.952.757.672.382.3 60.177.571.490.284.0 33.346.447.467.279.8

Downloads & Resources

📄

Paper

Read the full paper with detailed methodology and experiments

Download PDF
💾

Dataset

Access the complete SynthVerse dataset with annotations

Coming Soon
💻

Code

Training and evaluation scripts for point tracking methods

Coming Soon

Citation

@article{zhao2026SythnVerse,
  title={SynthVerse: A Large-Scale Diverse Synthetic Dataset for Point Tracking},
  author={Weiguang Zhao and Haoran Xu and Xingyu Miao and Qin Zhao and Rui Zhang and Kaizhu Huang and Ning Gao and Peizhou Cao and Mingze Sun and Mulin Yu and Tao Lu and Linning Xu and Junting Dong and Jiangmiao Pang},
  journal={arXiv preprint arXiv:2602.04441},
  year={2026}
}