JOURNAL ARTICLE

Interaction-Assisted Multi-Modal Representation Learning for Recommendation

Abstract

Personalized recommender systems have attracted significant attentions from both industry and academic. Recent studies have shed light on incorporating multi-modal side information into the recommender systems to further boost the performance. Meanwhile, transformer-based multi-modal representation learning has shown great enhancement for downstream visual and textual tasks. However, these self-supervised pre-training methods are not tailored for recommendation and may lead to suboptimal representations. To this end, we propose Interaction-Assisted Multi-Modal Representation Learning for Recommendation (IRL) to inject the information of user interactions into item multi-modal representation learning. Specifically, we extract item graph embedding through user-item interactions and then utilize it to formulate a novel triplet IRL training objective which serves as a behavior-aware pre-training task for the representation learning model. A range of experiments have been conducted on several real-world datasets and extensive results indicate the effectiveness of IRL.

Keywords:
Computer science Modal Recommender system Feature learning Embedding Representation (politics) Artificial intelligence Machine learning Graph Transformer Graph embedding Information retrieval Theoretical computer science Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
15
Refs
0.07
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Recommender Systems and Techniques
Physical Sciences →  Computer Science →  Information Systems
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.