Multi-modal sarcasm detection MSD aims to identify sarcastic sentiment conveyed through textual and visual modalities. The key challenge lies in capturing underlying incongruity across modalities. However, many existing studies rely on shallow feature fusion strategies, resulting in limited interaction between textual and visual features. Moreover, they often overlook localized inconsistencies in sarcasm, leading to insufficient representation of fine-grained sarcastic cues. To address these challenges, we propose a hierarchical incongruity-aware fusion network with semantic adaptive refinement HIAF . Specifically, we first introduce a hierarchical fusion module that progressively captures multi-level incongruity through iterative transformer layers, guided by a cross-modal locality-constrained attention mechanism. Second, we design a semantic adaptive refinement module that dynamically integrates unimodal and cross-modal features based on their contextual contributions. Experiments demonstrate consistent outperformance over strong baselines, validating its capability in capturing multi-modal incongruity.
Jiecheng ZhangC. L. Philip ChenShuzhen LiTong Zhang
Yujun WuChen WangMeixuan ChenTongguan WangYing Sha
Haochen ZhaoYongxiu XuXinkui LinJiarui LuHongbo XuYubin Wang
Yang QiaoLiqiang JingXuemeng SongXiaolin ChenLei ZhuLiqiang Nie
Yuzhen CaiHuiyu CaiXiaojun Wan