In today's technology landscape, CLIP (Contrastive Language-Image Pre-training) is an important multimodal foundational model. It combines visual signals and text signals into a shared feature space using contrastive learning loss on a large-scale dataset of image-text pairs. As a retriever, CLIP supports various tasks such as zero-shot classification, detection, segmentation, and image-text retrieval. Meanwhile, as a feature extractor, it performs well in nearly all...