The article introduces the BuboGPT model launched by ByteDance, which supports multi-modal joint understanding of text, images, and audio, and for the first time incorporates visual localization technology to accurately locate objects within images. Researchers have achieved good results in multi-modal tasks by adopting a training scheme that involves multi-modal instruction tuning. The model has been open-sourced and a playable demo page is provided.