ZhiYuan Launches Hour-Level Ultra-Long Video Understanding Model Video-XL

AIbase基地

Published inAI News · 3 min read · Oct 28, 2024

404

The Beijing Academy of Artificial Intelligence (BAAI) has collaborated with Shanghai Jiao Tong University, Renmin University of China, Peking University, and Beijing University of Posts and Telecommunications to introduce a large-scale video understanding model named Video-XL. This model is a significant demonstration of the core capabilities of multimodal large models and a crucial step towards general artificial intelligence (AGI). Compared to existing multimodal large models, Video-XL exhibits superior performance and efficiency in handling videos longer than 10 minutes.

WeChat Screenshot_20241028161117.png

Video-XL leverages the native capabilities of language models (LLM) to compress long visual sequences, retaining the ability to understand short videos, and demonstrates exceptional generalization capabilities in long video understanding. The model ranks first in multiple tasks across several mainstream long video understanding benchmark evaluations. Video-XL achieves a good balance between efficiency and performance, requiring only a single 80G graphics card to process 2048 frames of input, sample hour-long videos, and achieves close to 95% accuracy in the "needle in a haystack" video task.

WeChat Screenshot_20241028161127.png

Video-XL is expected to demonstrate extensive application value in scenarios such as movie summarization, video anomaly detection, and ad insertion detection, becoming a powerful assistant for long video understanding. The introduction of this model marks a significant step forward in the efficiency and accuracy of long video understanding technology, providing robust technical support for the future automation and analysis of long video content.

Currently, the Video-XL model code has been open-sourced to promote collaboration and technical sharing within the global multimodal video understanding research community.

Paper Title: Video-XL: Extra-Long Vision Language Model for Hour-Scale Video Understanding

Paper Link: https://arxiv.org/abs/2409.14485

Model Link: https://huggingface.co/sy1998/Video_XL

Project Link: https://github.com/VectorSpaceLab/Video-XL

JD.com's Embodied Intelligence Strategy Accelerates Rapidly, JoyInside Collaboration Map Exposed

According to NetEase Technology, JD.com's layout in the field of embodied intelligence is accelerating rapidly. The embodied intelligence brand JoyInside under JD.com has reached cooperation with more than ten leading robot companies, becoming the core engine for JD.com to seize the smart robot market. According to insiders, JoyInside is supported by JD's large model technology, focusing on providing smart interaction capabilities between robots and consumers. Its product strategy focuses on scenario-based applications such as one person, one dog, and one toy. Since its launch, the brand has successfully attracted leading enterprises from multiple niche fields to join.

Foxconn Launches Its First AI Inference Large Model FoxBrain, Trademark Application Submitted

Recently, Hon Hai Precision Industrial Co., Ltd. (commonly known as Foxconn) submitted a trademark registration application for "FoxBrain" to the Trademark Office of the National Intellectual Property Administration. This AI inference large model is not only Foxconn's first attempt but also the first AI model of this type in Taiwan. According to public information, the international classification of this trademark is scientific instruments, and it is currently in the "waiting for substantive examination" status. "FoxBrain" is an AI inference large model launched by the Hon Hai Research Institute, covering data analysis

Zhipu AI Open Sources GLM-4.1V-Thinking: A Breakthrough in Multimodal Reasoning

Zhipu AI officially open-sources its latest general vision model, GLM-4.1V-Thinking, based on the GLM-4V architecture, which introduces a chain-of-thought reasoning mechanism, significantly enhancing its capabilities for complex cognitive tasks. The model supports multimodal inputs such as images, videos, and documents, and excels in diverse scenarios including long video understanding, image question answering, subject problem-solving, text recognition, document interpretation, grounding, GUI Agent, and code generation, covering a wide range of industry application needs. GLM-4.1V-9B-Thinking

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Welcome to the [AI Daily] section! Here is your guide to exploring the world of artificial intelligence every day. Every day, we present you with the latest content in the AI field, focusing on developers, helping you understand technical trends and learn about innovative AI product applications. Click to learn more about new AI products: https://top.aibase.com/1、Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand audio and directly generate natural speech. Step-Audio-AQAA is an open source end-to-end speech large model,

AI News

AI Daily

AI Timeline

Al Hardware

Latest Cases

Image Collection

Video Collection

Audio Collection

Content Collection

Latest Tutorials

AI Product Ranking

AI Traffic Growth Ranking

AI Traffic Decline Ranking

AI Weekly Ranking

United States

China

India

Brazil

Image Generation

Personal Assistant

Character Generation

Video Generation

AI Project Ranking

AI Project Growth Ranking

AI Developer Ranking

AI Organization Ranking

Deepseek

TTS

LLM

ChatGPT

Overview

ZhiYuan Launches Hour-Level Ultra-Long Video Understanding Model Video-XL

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Major Breakthrough! Research Team Reveals the Hidden Reward Mechanism Inside Large Language Models

Baidu Launches the World's First Chinese Audio-Visual Generation Model MuseSteamer, Revolutionizing the Creative Process

JD.com's Embodied Intelligence Strategy Accelerates Rapidly, JoyInside Collaboration Map Exposed

Foxconn Launches Its First AI Inference Large Model FoxBrain, Trademark Application Submitted

Zhipu AI Launches GLM-4.1V-Thinking Open Source! A New Leader in Multimodal Reasoning, Challenging Top Models Worldwide

Zhipu AI Open Sources GLM-4.1V-Thinking: A Breakthrough in Multimodal Reasoning

AI Daily: Baidu Launches Drawn-Imagine Platform and MuseSteamer; Alibaba's Audio-Driven Full-Body Digital Human Model OmniAvatar

Open Source End-to-End Speech Large Model Step-Audio-AQAA: Understand Audio and Generate Natural Speech Directly

Ant Group's Medical AI Platform Wins SAIL Award at 2025 World Artificial Intelligence Conference

State Administration for Market Regulation Approves the Release of 7 National Standards Including Artificial Intelligence, Information Technology, and Internet of Things