In the field of video processing, efficiently tracking three-dimensional motion from single-lens videos has long been a challenging task, especially when it comes to pixel-level precise tracking of long sequences. Traditional methods face multiple challenges and often can only track a few key points, failing to achieve a detailed understanding of the entire scene.

image.png

Moreover, existing technologies require high computational demands, making it difficult to maintain efficiency when processing long videos. Additionally, long-term tracking is also affected by issues such as camera movement and object occlusion, leading to tracking errors or inaccuracies.

Currently, methods for motion estimation in video sequences have their own advantages and disadvantages. Optical flow technology provides dense pixel tracking, but in complex scenes, especially when dealing with long sequences, it lacks resilience.

Scene flow extends optical flow by estimating dense three-dimensional motion using RGB-D data or point clouds, but it remains challenging to apply efficiently in long sequences. Point tracking methods can capture motion trajectories and incorporate spatial and temporal attention for smoother tracking, but due to their high computational cost, they still struggle to achieve dense monitoring. Furthermore, reconstruction-based tracking methods use deformation fields to estimate motion, but they are not practical for real-time applications.

60f40d8292cd71591253b91a2794ffee.png

Recently, a research team from the University of Massachusetts Amherst, MIT-IBM Watson AI Lab, and Snap Inc. proposed DELTA (Dense Efficient Long-range 3D Tracking for Any video), a method designed specifically for efficiently tracking every pixel in three-dimensional space. DELTA starts with low-resolution tracking, employs a spatio-temporal attention mechanism, and applies an attention-based upsampler for high-resolution accuracy. Key innovations include an upsampler for clear motion boundaries, an efficient spatial attention architecture, and a logarithmic depth representation to enhance tracking performance.

DELTA has achieved advanced results on the CVO and Kubric3D datasets, improving by over 10% in metrics such as Average Jaccard (AJ) and 3D Average Position Difference (APD3D), and has also performed well in 3D point tracking benchmarks like TAP-Vid3D and LSFOdyssey. Unlike existing methods, DELTA achieves dense 3D tracking at scale, running over eight times faster than previous methods while maintaining industry-leading accuracy.

Experiments show that DELTA excels in 3D tracking tasks, with both speed and accuracy surpassing previous methods. DELTA was trained on the Kubric dataset, which includes over 5600 videos, with a loss function that combines 2D coordinates, depth, and visibility losses.

In benchmark tests, DELTA achieved the highest scores in long-range 2D tracking and dense 3D tracking on CVO and Kubric3D, completing tasks much faster than other methods. Design choices such as logarithmic depth representation, spatial attention, and attention-based upsampling significantly improve its accuracy and efficiency in various tracking scenarios.

DELTA is an efficient method that can track every pixel in video frames, achieving accuracy and faster runtime in dense 2D and 3D tracking. The method may face challenges on points with long-term occlusion, with the best performance observed in short videos of a few hundred frames. The accuracy of DELTA's 3D tracking depends on the precision and temporal stability of monocular depth estimation. Anticipated advancements in monocular depth estimation research are expected to further enhance the performance of this method.

Project link: https://snap-research.github.io/DELTA/

Key Points:

🌟 DELTA is a novel method designed specifically for efficiently tracking every pixel in single-lens videos.

⚡ DELTA has achieved leading results on the CVO and Kubric3D datasets, running over eight times faster than traditional methods.

🔍 The method may face challenges on points with long-term occlusion but performs excellently in short videos.