Liang ZhangYueqiu JiangWei YangB. Liu
Infrared-visible image fusion (IVIF) is an important part of multimodal image fusion (MMF). Our goal is to combine useful information from infrared and visible sources to produce strong, detailed, fused images that help people understand scenes better. However, most existing fusion methods based on convolutional neural networks extract cross-modal local features without fully utilizing long-range contextual information. This limitation reduces performance, especially in complex scenarios. To address this issue, we propose TCTFusion, a three-branch cross-modal transformer for visible–infrared image fusion. The model includes a shallow feature module (SFM), a frequency decomposition module (FDM), and an information aggregation module (IAM). The three branches specifically receive input from infrared, visible, and concatenated images. The SFM extracts cross-modal shallow features using residual connections with shared weights. The FDM then captures low-frequency global information across modalities and high-frequency local information within each modality. The IAM aggregates complementary cross-modal features, enabling the full interaction between different modalities. Finally, the decoder generates the fused image. Additionally, we introduce pixel loss and structural loss to significantly improve the model’s overall performance. Extensive experiments on mainstream datasets demonstrate that TCTFusion outperforms other state-of-the-art methods in both qualitative and quantitative evaluations.
T. J. ZhuJinyong ChenGang Wang
Xiaolin ShiZhen WangXinping PanJunjie LiKe Wang
Xiangzeng LiuHaojie GaoQiguang MiaoYue XiYunfeng AiDingguo Gao
Baomin FangXiaoxue XingWei ZhangSimin WuDongfang Yuan