FLAME

[AAAI-25 Oral] Official Implementation of "FLAME: Learning to Navigate with Multimodal LLM in Urban Environments"

embodied-agent large-multimodal-models multimodal-large-language-models streetview vision-and-language-navigation vision-language-model

Creat：2024-08-20T22:43:06

Update：2025-03-27T03:29:16

https://flame-sjtu.github.io

Stars

Stars Increase

Related projects

MobileAgent

agent

Mobile-Agent: The Powerful Mobile Device Operation Assistant Family

4132

1个月前

+14today

Star Vector

llm

StarVector is a foundation model for SVG generation that transforms vectorization into a code generation task. Using a vision-language modeling architecture, StarVector processes both visual and textual inputs to produce high-quality SVG code with remarkable precision.

3700

1个月前

+18today

RPG DiffusionMaster

image-editting

[ICML 2024] Mastering Text-to-Image Diffusion: Recaptioning, Planning, and Generating with Multimodal LLMs (RPG)

1796

1个月前

+1today

Awesome Embodied Robotics And Agent

agent

This is a curated list of "Embodied AI or robot with Large Language Models" research. Watch this repository for the latest updates! ?

1311

1个月前

+3today

ShareGPT4Video

chatgpt

[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions

1053

1个月前

Ovis

chatbot

A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.

896

1个月前

+2today

VideoChat

asr

实时语音交互数字人，支持端到端语音方案（GLM-4-Voice - THG）和级联方案（ASR-LLM-TTS-THG）。可自定义形象与音色，无须训练，支持音色克隆，首包延迟低至3s。Real-time voice interactive digital human, supporting end-to-end voice solutions (GLM-4-Voice - THG) and cascaded solutions (ASR-LLM-TTS-THG). Customizable appearance and voice, supporting voice cloning, with initial package delay as low as 3s.

894

1个月前

+2today