VILA is a pre-trained visual language model (VLM) that achieves video and multi-image understanding capabilities through pre-training with large-scale interleaved image-text data. VILA can be deployed on edge devices using the AWQ 4bit quantization and TinyChat framework. Key advantages include: 1) Interleaved image-text data is crucial for performance enhancement; 2) Not freezing the large language model (LLM) during interleaved image-text pre-training promotes context learning; 3) Re-mixing text instruction data is critical for boosting VLM and plain text performance; 4) Token compression can expand the number of video frames. VILA demonstrates captivating capabilities including video inference, context learning, visual reasoning chains, and better world knowledge.