Multimodal image pairs (e.g. visible and thermal images) can provide mutually beneficial pixel information and enhance the robustness and reliability of object detection in applications such as autonomous driving and video surveillance. To benefit from the effective information of both modalities, a multimodal feature fusion network based on YOLOv5 is proposed in this paper. Multimodal feature fusion adaptive weighting module is designed to perform feature extraction and fusion at three scales in the network to achieve the best utilization of multimodal features. Experiments show that our multimodal object detection network (MFF-YOLOv5) achieves better performance on two public datasets compared with the current state-of-the-art (SOTA) methods.
Qiang ZhangTonglin XiaoNianchang HuangDingwen ZhangJungong Han
Ling‐bing MengMengYa YuanXuehan ShiQingqing LiuWeiwei DuanFei ChengLingli Li
Li ZhuTuanjie LiYuming NingYan Zhang
Fengming SunKang ZhangXia YuanChunxia Zhao
Yuanlin ChenZhenan SunCheng YanMing Zhao