PPLLaVA
GPU implementation model for video sequence understanding
CommonProductVideoVideo UnderstandingLarge Language Model
PPLLaVA is an efficient large-scale video language model that combines fine-grained visual prompt alignment, a convolutional-style pooling mechanism for visual token compression based on user instructions, and CLIP context extension. This model has achieved new state-of-the-art results on datasets such as VideoMME, MVBench, VideoChatGPT Bench, and VideoQA Bench, using only 1024 visual tokens, achieving an 8-fold improvement in throughput.
PPLLaVA Visit Over Time
Monthly Visits
515580771
Bounce Rate
37.20%
Page per Visit
5.8
Visit Duration
00:06:42