NVIDIA Launches New Visual Speech Model NVEagle, Capable of Chatting with Images

AIbase基地

Published inAI News · 5 min read · Sep 2, 2024

293

Recently, NVIDIA, in collaboration with research teams from Georgia Tech, UMD, and HKPU, has introduced a cutting-edge vision-language model — NVEagle. This model can interpret images and engage in conversation, functioning as a super assistant that can both see and speak.

For instance, in the example below, asking the NVEagle model who the person in the image is results in the model interpreting the picture and providing the answer: Jensen Huang. Remarkably accurate.

This multimodal large language model (MLLM) has taken a significant step forward in integrating visual and linguistic information. NVEagle is capable of understanding complex real-world scenes, enhancing interpretation and responses through visual inputs. Its core design involves converting images into visual tokens, which are then combined with text embeddings, thereby improving the understanding of visual information.

However, building such a powerful model comes with numerous challenges, particularly in enhancing visual perception capabilities. Research shows that many existing models experience "hallucination" phenomena, producing inaccurate or meaningless outputs, especially when dealing with high-resolution images. This is particularly evident in tasks requiring detailed analysis, such as optical character recognition (OCR) and document understanding. To overcome these difficulties, the research team explored various methods, including testing different visual encoders and fusion strategies.

The launch of NVEagle represents the culmination of this research, encompassing three versions: Eagle-X5-7B, Eagle-X5-13B, and Eagle-X5-13B-Chat. The 7B and 13B versions are primarily used for general vision-language tasks, while the 13B-Chat version is fine-tuned specifically for conversational AI, enabling better interaction based on visual inputs.

A notable feature of NVEagle is its use of a Mixture of Experts (MoE) mechanism, which dynamically selects the most suitable visual encoder for different tasks, significantly enhancing the processing capability for complex visual information. The model has been released on Hugging Face, making it accessible for researchers and developers.

NVEagle has performed exceptionally well in various benchmark tests. For example, in OCR tasks, the Eagle model achieved an average score of 85.9 on OCRBench, surpassing other leading models like InternVL and LLaVA-HR. In the TextVQA test, it scored 88.8, and in complex visual question-answering tasks, it scored 65.7 on the GQA test. Additionally, the model's performance continues to improve with the addition of extra visual experts.

Through systematic design exploration and optimization, the NVEagle series of models have successfully addressed several key challenges in visual perception, paving the way for the development of vision-language models.

Demo: https://huggingface.co/spaces/NVEagle/Eagle-X5-13B-Chat

Key Points:
🌟 NVEagle is NVIDIA's new-generation vision-language model aimed at enhancing understanding of complex visual information.
📈 The model includes three versions, each suited for different tasks, with the 13B-Chat version focusing on conversational AI.
🏆 NVEagle outperforms many leading models in multiple benchmark tests, demonstrating outstanding performance.

Moonshot AI Releases and Opensources Kimi K2 Model, Strong in Code and Agentic Tasks

Moonshot AI officially released its latest creation - the Kimi K2 model, and simultaneously announced its open source. This foundation model based on the MoE architecture has gained widespread attention in the AI field since its release, thanks to its strong coding capabilities and excellent general Agent task processing abilities. The Kimi K2 model has a total of 1T parameters, with 32B activated parameters. It has achieved top performance among open-source models in a series of benchmark performance tests such as SWE Bench Verified, Tau2, and AceBench.

Tencent Hunyuan-A13B Model API Launches

Recently, Tencent Cloud officially launched the API service for the Tencent Hunyuan A13B model on its official website. The input price is set at 0.5 yuan per million Tokens, and the output price is 2 yuan per million Tokens, which has quickly sparked enthusiastic discussions in the developer community. As the first 13B-level MoE (Mixture of Experts) open-source hybrid inference model in the industry, Hunyuan-A13B features a total of 80B parameters and only 13B activated parameters, achieving performance comparable to leading open-source models of the same architecture, while also demonstrating efficient reasoning capabilities.

AI Daily: Zhipu Launches PPT Generation Function AI Slides; Ke Ling AI Releases Ketur 2.1 Model

1. Zhipu launches free AI Slides for PPT generation. 2. Keling AI introduces KeTu 2.1 with 180 styles. 3. NVIDIA's DiffusionRenderer enables 3D scene editing. 4. Modao AI offers 30-second prototype generation. 5. Higgsfield creates avatars from 10 photos. 6. Google open-sources GenAI Processors. 7. Google Veo3 adds image-to-video. 8. Mistral AI releases Devstral2507 for code generation.....

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

Mistral AI launched the Devstral2507 series with two AI models: the open-source Devstral Small1.1 (24 billion parameters, SWE-Bench score of 53.6%) and the enterprise version Devstral Medium2507 (score of 61.6%). Small1.1 supports a 128k context window and local deployment, while Medium2507 outperforms some commercial models. Both are optimized for code reasoning and program synthesis, and support integration with agent frameworks.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

NVIDIA Launches New Visual Speech Model NVEagle, Capable of Chatting with Images

AIbase基地

This article is from AIbase Daily

AI News Recommendations

Moonshot AI Releases and Opensources Kimi K2 Model, Strong in Code and Agentic Tasks

Tencent Hunyuan-A13B Model API Launches

AI Daily: Zhipu Launches PPT Generation Function AI Slides; Ke Ling AI Releases Ketur 2.1 Model

Mistral AI Releases Devstral2507: Designed for Code-Centric Language Modeling

Microsoft BioEmu Model Dramatically Shortens Protein Simulation Time

City Commercial Banks Are Launching a Trend of Large Model Bidding, with Million-Level Investments Becoming a New Industry Opportunity!

Personification of Large AI Models: Grok 4 and Empathy with Musk?

Kling AI Releases KTu 2.1 Model: Significant Improvement in Image Generation Capabilities, Supports 180 Styles

Keling AI Launches Keltu 2.1 Model, Will Be Free for All Members for 7 Days

vivo New Multimodal Model Launches! AI's Ability to Understand GUI Interfaces is Upgraded Again!