LongVA is a long context transformer model capable of processing over 2000 frames or 200K visual tokens. It achieves leading performance in Video-MME among 7B models. The model is tested on CUDA 11.8 and A100-SXM-80G and can be quickly deployed and used through the Hugging Face platform.