SlowFast-LLaVA is a zero-training multimodal large language model designed for video understanding and reasoning. It achieves performance comparable to or even better than state-of-the-art video large language models across various video question-answering tasks and benchmarks, without the need for fine-tuning on any data.