Researchers from Nanjing University and Megvii Research Institute have collaborated to revolutionize large-scale visual models. Their unsupervised paradigm, SeVa, has successfully addressed the alignment issue of visual-language models without requiring human or GPT-4 involvement, significantly reducing alignment costs.

The core of this technology lies in the automated construction of preference data pipelines. By comparing model outputs before and after preference alignment, noticeable changes can be observed. Researchers found that even minor image augmentations could lead VLM to produce different responses to the same question. Therefore, they used the original image responses as positive samples and the augmented image responses as negative samples for training.

image.png

The experimental results of SeVa are remarkable. Using only 8k unsupervised data, it significantly enhanced VLM's instruction following ability, reduced hallucinations, and achieved noticeable improvements on multimodal benchmarks. More importantly, this method is simple, cost-effective, and does not require any human or GPT-4 annotations.

Tests on multiple benchmarks show that SeVa has a significant advantage in aligning visual models with human preferences. It particularly excels on MMVet and LLaVA-bench evaluated by GPT-4. Additionally, SeVa can generate longer, more detailed responses with higher consistency and stronger robustness to perturbations at different temperatures.

This research not only provides an effective solution to the alignment problem of large-scale visual models but also opens up new possibilities for the development of AI. With SeVa being open-sourced, we can anticipate that more researchers and developers will utilize this paradigm to further advance AI technology. In this era full of endless possibilities, let us look forward to more surprises brought by AI technology.

Project Address: https://github.com/Kevinz-code/SeVa