Tarsier is a series of large-scale video language models developed by the ByteDance research team, designed to generate high-quality video descriptions and equipped with robust video comprehension capabilities. The model significantly enhances the accuracy and detail of video descriptions through a two-stage training strategy (multi-task pre-training and multi-granularity instruction fine-tuning). Its main advantages include high precision in video description, understanding of complex video content, and achieving state-of-the-art (SOTA) results in multiple video comprehension benchmark tests. The model's development addresses the shortcomings in detail and accuracy of existing video language models, achieving new heights in video description through extensive training on high-quality data and innovative training methods. Currently, the model is not explicitly priced and is mainly targeted at academic research and commercial applications, suitable for scenarios requiring high-quality understanding and generation of video content.