Byte's Large Model New Progress: First Time Introducing Visual Localization for Fine-Grained Multimodal Joint Understanding, Now Open Source & Demo Available
新智元
16
The article introduces the BuboGPT model launched by ByteDance, which supports multi-modal joint understanding of text, images, and audio, and for the first time incorporates visual localization technology to accurately locate objects within images. Researchers have achieved good results in multi-modal tasks by adopting a training scheme that involves multi-modal instruction tuning. The model has been open-sourced and a playable demo page is provided.
© Copyright AIbase Base 2024, Click to View Source - https://www.aibase.com/news/508