One of the recent advancements in the field of computer vision is the "Segment Anything Model." Meta released this model in April, capable of automatically segmenting all content in an image. The model is based on a prompt-based visual Transformer architecture, trained using over 1.1 billion masks from more than 11 million images. Researchers have also proposed an improvement approach, utilizing masked image pre-training and the SAM model to obtain a high-quality pre-trained ViT encoder. This method reduces the complexity of SAM while maintaining good performance, and has achieved better results in multiple tasks compared to other pre-trained models.