Researchers from the University of Hong Kong and Tencent have proposed a new paradigm for multimodal recommendation systems called DiffMM, aiming to enhance the accuracy of short video recommendations. The system creates a graph containing user and video information and utilizes graph diffusion and contrastive learning techniques to better understand the relationship between users and videos, thereby achieving more accurate recommendations.

The model method of DiffMM primarily includes three parts: multimodal graph diffusion model, multimodal graph aggregation, and cross-modal contrastive enhancement. The multimodal graph diffusion model, through modal-aware denoising diffusion probability models, unifies user-item collaborative signals with multimodal information, effectively addressing the negative impacts in multimodal recommendation systems. At the same time, it achieves modal-aware user-item graph generation and optimization through graph probability diffusion paradigm and modal-aware graph diffusion optimization.

image.png

In terms of cross-modal contrastive enhancement, DiffMM uses modal-aware contrastive views and contrastive enhancement methods to capture the consistency of user interaction patterns across different item modalities, improving the performance of the recommendation system.

Paper: https://arxiv.org/abs/2406.1178

Key Points:

⭐ HKU and Tencent introduce the new paradigm DiffMM to enhance the performance of multimodal recommendation systems.

⭐ DiffMM utilizes graph diffusion and contrastive learning techniques to better understand the relationship between users and videos.

⭐ Cross-modal contrastive enhancement methods improve the accuracy and performance of the recommendation system.