Waymo recently announced a significant breakthrough, developing a new training model based on Google's multi-modal large language model (MLLM) Gemini, for the development of its autonomous taxi services. This new model, dubbed EMMA (End-to-End Multi-Modal Autonomous model), is capable of processing sensor data to generate future trajectories for autonomous vehicles, aiding them in deciding where to go and how to avoid obstacles.
EMMA is one of the first indications that leading companies in autonomous driving plan to utilize MLLMs in their operations, suggesting that these LLMs can move beyond their current roles as chatbots, email managers, and image generators to find applications in entirely new environments like roads.
Waymo's research team stated that MLLMs like Gemini offer intriguing solutions for autonomous systems for two reasons: chatbots are "generalists" trained on vast amounts of data scraped from the internet, "capable of providing rich 'world knowledge' beyond what is contained in ordinary driving logs"; and they demonstrate "superior" reasoning abilities through techniques like "chain-of-thought reasoning," mimicking human reasoning by breaking down complex tasks into a series of logical steps.
Waymo's EMMA model excels in trajectory prediction, object detection, and road map understanding, but it also has limitations, such as the inability to integrate 3D sensor inputs from LiDAR or radar and the capacity to process only a small number of image frames at a time. Utilizing MLLMs to train autonomous taxis also carries risks, such as the model experiencing hallucinations or failing to accomplish simple tasks.
Therefore, Waymo indicates that further research is needed to address these issues and advance the state-of-the-art in autonomous model architectures.