DriveVLM

Fusion of Autonomous Driving and Visual Language Models

CommonProductOthersAutonomous DrivingVisual Language Models

DriveVLM is an autonomous driving system that leverages visual language models (VLMs) to augment scene understanding and planning capabilities. The system employs a unique combination of reasoning modules, encompassing scene description, scene analysis, and hierarchical planning, to enhance comprehension of complex and long-tail scenarios. Addressing the limitations of VLMs in spatial reasoning and computational demands, DriveVLM-Dual was developed as a hybrid system, integrating the strengths of DriveVLM with traditional autonomous driving pipelines. Experiments on the nuScenes and SUP-AD datasets demonstrate the effectiveness of DriveVLM and DriveVLM-Dual in handling complex and unpredictable driving conditions. Ultimately, DriveVLM-Dual has been deployed in production vehicles, validating its efficacy in real-world autonomous driving environments.

Product Finder

Product Submit

AI Models Finder

MCP Servers

MCP Client

MCP Inspector

Case Tutorials

Latest AI News

AI Daily Brief

DriveVLM

DriveVLM Visit Over Time

DriveVLM Visit Trend

DriveVLM Visit Geography

DriveVLM Traffic Sources

DriveVLM Alternatives

DriveVLM — Fusion of Autonomous Driving and Visual Language Models

Large World Models — Large World Models: Understanding Video and Language

We, Robot — Tesla's Vision for Autonomous Driving Technology and Robotics

GenAD — A large-scale video generation model for autonomous driving

MiniGPT-4 — An advanced large language model enhanced for visual language understanding.

Visual Sketchpad — A visual reasoning tool for multimodal large language models (LLMs)

Vary — Visual Vocabulary Expansion for Large-Scale Visual Language Models

GAIA-2 — GAIA-2 is an advanced video generation model for creating safe autonomous driving scenarios.

POINTS-Qwen-2-5-7B-Chat — Latest advancements in visual language models

DeepSeek-VL2 — An advanced multimodal understanding model that integrates visual and linguistic capabilities.

Qwen2-VL-2B — A state-of-the-art visual language model that supports multimodal understanding and text generation.

VSP-LLM — A framework that combines Visual Speech Processing with Large Language Models

OpenEMMA — An open-source end-to-end multimodal model for autonomous driving.

BlockFusion — Expands 3D scene generation models

ColPali — Efficient document retrieval tool based on visual language models

UniTok — UniTok is a unified visual tokenizer for visual generation and understanding.

DiffusionDrive — A truncated diffusion model for real-time end-to-end autonomous driving.

vision-parse — Utilizes visual language models to parse PDFs into Markdown.

InternLM-XComposer2 — A large visual language model specializing in free-form text-to-image synthesis and understanding.

LaVi-Bridge — Connects different language models and generative visual models for text-to-image generation

Models Table — A comprehensive list and information about large language models

Qwen2-VL-72B — The latest visual language model supporting multilingual and multimodal understanding

MMStar — An elite benchmark dataset for evaluating large visual language models

MM1.5 — Optimization and analysis of multimodal large language models

Florence-VL — Enhancement tool for visual language models, combining generative visual encoders and deep breadth fusion technology.

VLM-R1 — VLM-R1 is a stable and versatile reinforcement learning-enhanced visual-language model focused on visual understanding tasks.

InternLM-XComposer-2.5 — A Multifunctional Large Visual Language Model

Qwen2-VL-7B — Qwen2-VL-7B is the latest visual language model that supports multimodal understanding and text generation.

MiniGemini — A multimodal large language model capable of understanding and generating images

InternVL2_5-1B-MPO — A multimodal large language model that enhances integrated understanding of visual and language data.