Google has released PaLI-3, a compact visual-language model that achieves state-of-the-art (SOTA) performance. Utilizing contrastive pre-training methods, it delves into the potential of Vision-and-Language (VIT) models, reaching SOTA levels in multi-lingual modal retrieval. PaLI-3 integrates natural language understanding with image recognition, becoming a significant force in AI innovation. The contrastive pre-training approach based on SigLIP ushers in a new era of multi-lingual cross-modal retrieval. Although not fully open-sourced yet, it offers multi-lingual and English SigLIP models, providing researchers with opportunities to experiment.