Researchers introduce FACTORCL, a novel approach to address the limitations of contrastive learning in multimodal scenarios. By factorizing task-relevant information, maximizing the lower bound of mutual information, and minimizing the upper bound of mutual information, FACTORCL achieves optimal representation of task-relevant information. Employing multimodal augmentation without explicit labeling, it establishes task relevance in a self-supervised context. Experiments demonstrate that FACTORCL achieves new state-of-the-art performance on multiple datasets.