Cola

Large language models are visual reasoning coordinators.

CommonProductProductivityLanguage ModelVisual Reasoning
Cola is a method that uses a language model (LM) to aggregate the outputs of 2 or more vision-language models (VLMs). Our model assembly method is called Cola (COordinative LAnguage model or visual reasoning). Cola performs best when the LM is fine-tuned (called Cola-FT). Cola is also effective in zero-shot or few-shot context learning (called Cola-Zero). In addition to performance improvements, Cola is also more robust to VLM errors. We demonstrate that Cola can be applied to various VLMs (including large multimodal models like InstructBLIP) and 7 datasets (VQA v2, OK-VQA, A-OKVQA, e-SNLI-VE, VSR, CLEVR, GQA), and it consistently improves performance.
Visit

Cola Visit Over Time

Monthly Visits

494758773

Bounce Rate

37.69%

Page per Visit

5.7

Visit Duration

00:06:29

Cola Visit Trend

Cola Visit Geography

Cola Traffic Sources

Cola Alternatives