GPT-4 Level! VITA-1.5: Real-time Visual and Voice Interaction with 1.5 Seconds Interaction Delay

AIbase基地

Published inAI News · 4 min read · Jan 7, 2025

337

Recently, the VITA-MLLM team announced the launch of VITA-1.5, an upgraded version based on VITA-1.0, aimed at enhancing the real-time performance and accuracy of multimodal interactions. VITA-1.5 not only supports English and Chinese but also achieves significant improvements across various performance metrics, providing users with a smoother interactive experience.

In VITA-1.5, the interaction delay has been significantly reduced from the original 4 seconds to just 1.5 seconds, making it nearly imperceptible for users during voice interactions. Additionally, this version shows remarkable improvements in multimodal performance; evaluations indicate that VITA-1.5's average performance in several benchmark tests, including MME, MMBench, and MathVista, has increased from 59.8 to 70.8, showcasing its exceptional capabilities.

VITA-1.5 has also undergone extensive optimization in its speech processing capabilities. The error rate of its Automatic Speech Recognition (ASR) system has been significantly reduced from 18.4 to 7.5, leading to more accurate understanding and response to voice commands. Furthermore, VITA-1.5 introduces an end-to-end Text-to-Speech (TTS) module that can directly accept embeddings from large language models (LLMs) as input, enhancing the naturalness and coherence of speech synthesis.

To ensure a balanced multimodal capability, VITA-1.5 employs a progressive training strategy, minimizing the impact of the new speech processing module on visual-language performance, with image understanding slightly decreasing from 71.3 to 70.8. Through these technological innovations, the team has further pushed the boundaries of real-time visual and voice interaction, laying the foundation for future intelligent interactive applications.

In terms of usage, developers can quickly get started with VITA-1.5 through simple command-line operations, which also include basic and real-time interactive demonstrations. Users will need to prepare some essential modules, such as the Voice Activity Detection (VAD) module, to enhance the real-time interaction experience. Additionally, VITA-1.5 will open-source its code, facilitating participation and contributions from a wide range of developers.

The launch of VITA-1.5 marks another significant advancement in the field of interactive multimodal large language models, reflecting the team's relentless pursuit of technological innovation and user experience.

Project link: https://github.com/VITA-MLLM/VITA?tab=readme-ov-file

Key Highlights:

🌟 VITA-1.5 significantly reduces interaction delay from 4 seconds to 1.5 seconds, greatly enhancing user experience.

📈 Improved multimodal performance, with average performance in multiple benchmark tests increasing from 59.8 to 70.8.

🔊 Enhanced speech processing capabilities, with ASR error rate decreasing from 18.4 to 7.5, resulting in more accurate speech recognition.

Ant Group Launches Multimodal Application Lingguang with Built-in AGI Camera, Internal Testing Has Begun

The "Lingguang" application under Alipay has started internal testing, supporting login with a phone number or Alipay account. Its core feature, the "AGI Camera," can recognize real-world scene content through the lens in real time, enabling shooting and questioning as well as intelligent interaction, demonstrating the potential of multimodal AI applications.

Kimi k2 Performance Praised to Surpass GPT-5, Moonshot AI Secures Another Billion-Dollar Funding Round

Domestic AI company Moonshot AI is about to complete another round of billion-dollar funding, just a few months after its previous $300 million funding round. The capital market continues to show strong confidence in the company, which was once hailed as one of China's most anticipated large model companies.

Microsoft Copilot Adds Group Chat, Memory, and Edge AI Mode: Up to 32 People Can Collaborate in Real Time

Microsoft's fall update for Copilot AI assistant adds group chat functionality, supporting real-time collaboration with up to 32 people, allowing for brainstorming, planning, or collaborative writing. As an intelligent hub, Copilot can automatically summarize content, enhance collaboration, memory, and personalization, improving team efficiency.

One-Click Video Modification! Google Veo 3.1 is about to launch precise video editing features, so realistic that it's hard to tell the difference between real and fake

Google Veo 3.1 introduces a revolutionary 'precise editing' feature, allowing users to easily add or remove elements from videos while maintaining the original video's integrity and realism. The technology can handle complex details such as shadows and environmental interactions, offering creators greater freedom and advancing AI video from generation to professional post-production, achieving full optimization.

Fish Audio Launches Upgraded S1 Voice Cloning Model: Clone Real Human Speech in 10 Seconds

Fish Audio released an upgraded version of the S1 voice cloning model, achieving breakthroughs in emotional expressiveness and realism. The model can generate realistic human-like voices with emotions, rhythm, and tone variations. It can clone a human voice with just 10 seconds of audio sample, fully preserving the original voice's accent, intonation, rhythm, and speaking habits, producing highly realistic results.

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

GPT-4 Level! VITA-1.5: Real-time Visual and Voice Interaction with 1.5 Seconds Interaction Delay

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Ant Group Launches Multimodal Application Lingguang with Built-in AGI Camera, Internal Testing Has Begun

Kimi k2 Performance Praised to Surpass GPT-5, Moonshot AI Secures Another Billion-Dollar Funding Round

Microsoft Copilot Adds Group Chat, Memory, and Edge AI Mode: Up to 32 People Can Collaborate in Real Time

Samsung Galaxy XR Headset Makes Official Debut: Starts at $1799, Performance Matches Vision Pro, Lighter Weight, More Open Ecosystem

Poe Launches Ranking Feature to Real-Time Update Popularity of AI Models and Popular Apps

ByteDance Launches Sa2VA: Achieving Multimodal Intelligent Segmentation by Combining LLaVA with SAM-2

One-Click Video Modification! Google Veo 3.1 is about to launch precise video editing features, so realistic that it's hard to tell the difference between real and fake

Fish Audio Launches Upgraded S1 Voice Cloning Model: Clone Real Human Speech in 10 Seconds

Breaking the Bottleneck! Shanghai Jiao Tong University and Shanghai AI Lab Collaborate to Enhance the Reflective Ability of Multimodal Large Models

Adobe AI Foundry Launches Customized Services to Create Unique Firefly Models for Enterprises

Latest AI News

AI Daily Brief

AI Product Finder

AI Product Rankings

AI Product Submit

AI Tools Directory

AI Models Finder

LLM Leaderboard

Model Providers

Submit Your Model

Compare LLMs

LLM Cost Calculator

LLM Arena

MCP Servers

MCP Client

MCP Case Tutorials

MCP Ranking

MCP Service Submission

MCP Playground

MCP Inspector

GEO Services​

AI Search Visibility Checker

AI Model Compatibility Checker

AI Dataset Collection

Intelligent Document Recognition

GPT-4 Level! VITA-1.5: Real-time Visual and Voice Interaction with 1.5 Seconds Interaction Delay

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Ant Group Launches Multimodal Application Lingguang with Built-in AGI Camera, Internal Testing Has Begun

Kimi k2 Performance Praised to Surpass GPT-5, Moonshot AI Secures Another Billion-Dollar Funding Round

Microsoft Copilot Adds Group Chat, Memory, and Edge AI Mode: Up to 32 People Can Collaborate in Real Time

Samsung Galaxy XR Headset Makes Official Debut: Starts at $1799, Performance Matches Vision Pro, Lighter Weight, More Open Ecosystem

Poe Launches Ranking Feature to Real-Time Update Popularity of AI Models and Popular Apps

ByteDance Launches Sa2VA: Achieving Multimodal Intelligent Segmentation by Combining LLaVA with SAM-2

One-Click Video Modification! Google Veo 3.1 is about to launch precise video editing features, so realistic that it's hard to tell the difference between real and fake

Fish Audio Launches Upgraded S1 Voice Cloning Model: Clone Real Human Speech in 10 Seconds

Breaking the Bottleneck! Shanghai Jiao Tong University and Shanghai AI Lab Collaborate to Enhance the Reflective Ability of Multimodal Large Models

Adobe AI Foundry Launches Customized Services to Create Unique Firefly Models for Enterprises

GEO Services