Liangyi KangJie LiuDan YeZhiyang Zhou
Multimodal sarcasm is often used to express strong emotions online through the discrepancy of the literal-figurative scene across multi-modalities. Current researches retrofit transform-based pretrained language models to integrate text and image to detect sarcasm. However, these methods struggle to distinguish subtle semantic and emotional differences between image and text within the same instance. To address this issue, this paper proposes a new context-aware dual attention network that collaboratively performs textual and visual attentions using a shared memory module. This approach enables us to reason about the interconnected portions involving sarcasm in both text and image. Additionally, we use implicit context derived from multimodal commonsense graph to establish a holistic perspective that encompasses semantics and emotions across modalities. Finally, multi-view cross-modal matching technique is employed to effectively identify contradictions. We evaluate our method on the widely used HFM dataset and achieve 1.01% improvements on the F1-score. Extensive experiments demonstrate the effectiveness of the proposed method.
Xinkai LuYing QianYan YangWenrao Pang
Yangyang LiYuelin LiShihuai ZhangGuangyuan LiuYanqiao ChenRonghua ShangLicheng JiaoRonghua ShangLicheng Jiao
Wangqun ChenFuqiang LinGuowei LiXuan ZhangBo Liu
Liujing SongZefang ZhaoYuxiang MaYuyang LiuJun Li
Yujun WuChen WangMeixuan ChenTongguan WangYing Sha