MEGVII Technology has released Vary-toy, an advanced visual vocabulary large language model designed for standard GPUs. By optimizing the creation of visual vocabulary, it aims to enhance image perception capabilities. Vary-toy has achieved significant results in multiple benchmark tests, including DocVQA, ChartQA, and RefCOCO, among others. Its compact size makes it a practical benchmark for researchers with limited resources. The researchers plan to publicly release the code to promote further research and adoption.