DL3DV-10K is a large-scale real-world dataset containing over 10,000 high-quality videos. Each video is manually annotated with key scene points and complexity, and also provides camera pose, NeRF depth estimation, point clouds, and 3D meshes. The dataset can be used for general NeRF research, scene consistency tracking, visual language models, and other computer vision studies.