Understanding Video Transformers

Conceptual discovery for explaining the decision-making process of video Transformers

CommonProductVideoVideoInterpretability
This paper investigates the problem of conceptual interpretability for video Transformer representations. Specifically, we aim to explain the decision-making process of video Transformers based on high-level spatio-temporal concepts that are automatically discovered. Previous research on concept-based interpretability has primarily focused on image-level tasks. In contrast, video models handle the additional time dimension, increasing complexity and posing challenges in identifying dynamic concepts that evolve over time. In this work, we systematically address these challenges by introducing the first video Transformer Concept Discovery (VTCD) algorithm. To this end, we propose an effective unsupervised method for identifying video Transformer representation units (concepts) and rank their importance in the model output. The obtained concepts exhibit high interpretability, revealing the spatio-temporal reasoning mechanisms and object-centric representations within black-box video models. Through joint analysis on diverse supervised and self-supervised representations, we discover that some of these mechanisms are prevalent across video Transformers. Finally, we demonstrate that VTCD can be used to improve the performance of models on fine-grained tasks.
Visit

Understanding Video Transformers Visit Over Time

Monthly Visits

17788201

Bounce Rate

44.87%

Page per Visit

5.4

Visit Duration

00:05:32

Understanding Video Transformers Visit Trend

Understanding Video Transformers Visit Geography

Understanding Video Transformers Traffic Sources

Understanding Video Transformers Alternatives