Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images

Qianqian Liu; Xili Wang

doi:10.3390/rs16132289

ScienceGate Book Chapters

JOURNAL ARTICLE

Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images

Qianqian Liu Xili Wang

Year: 2024 Journal: Remote Sensing Vol: 16 (13)Pages: 2289-2289 Publisher: Multidisciplinary Digital Publishing Institute

DOI: 10.3390/rs16132289

Get Full-Text PDF Get Analytical Report

Abstract

Image–text multimodal deep semantic segmentation leverages the fusion and alignment of image and text information and provides more prior knowledge for segmentation tasks. It is worth exploring image–text multimodal semantic segmentation for remote sensing images. In this paper, we propose a bidirectional feature fusion and enhanced alignment-based multimodal semantic segmentation model (BEMSeg) for remote sensing images. Specifically, BEMSeg first extracts image and text features by image and text encoders, respectively, and then the features are provided for fusion and alignment to obtain complementary multimodal feature representation. Secondly, a bidirectional feature fusion module is proposed, which employs self-attention and cross-attention to adaptively fuse image and text features of different modalities, thus reducing the differences between multimodal features. For multimodal feature alignment, the similarity between the image pixel features and text features is computed to obtain a pixel–text score map. Thirdly, we propose a category-based pixel-level contrastive learning on the score map to reduce the differences among the same category’s pixels and increase the differences among the different categories’ pixels, thereby enhancing the alignment effect. Additionally, a positive and negative sample selection strategy based on different images is explored during contrastive learning. Averaging pixel values across different training images for each category to set positive and negative samples compares global pixel information while also limiting sample quantity and reducing computational costs. Finally, the fused image features and aligned pixel–text score map are concatenated and fed into the decoder to predict the segmentation results. Experimental results on the ISPRS Potsdam, Vaihingen, and LoveDA datasets demonstrate that BEMSeg is superior to comparison methods on the Potsdam and Vaihingen datasets, with improvements in mIoU ranging from 0.57% to 5.59% and 0.48% to 6.15%, and compared with Transformer-based methods, BEMSeg also performs competitively on LoveDA dataset with improvements in mIoU ranging from 0.37% to 7.14%.

Keywords:

Computer science Artificial intelligence Segmentation Computer vision Feature (linguistics) Fusion Remote sensing Pattern recognition (psychology) Geology

Metrics

Cited By

3.69

FWCI (Field Weighted Citation Impact)

Refs

0.89

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Remote-Sensing Image Classification

Physical Sciences → Engineering → Media Technology

Remote Sensing and Land Use

Physical Sciences → Earth and Planetary Sciences → Atmospheric Science

Advanced Image Fusion Techniques

Physical Sciences → Engineering → Media Technology

Bidirectional Feature Fusion and Enhanced Alignment Based Multimodal Semantic Segmentation for Remote Sensing Images

Abstract

Metrics

Citation History

Topics

Related Documents

Remote sensing semantic segmentation based on multimodal feature alignment and fusion

Efficient semantic segmentation of remote sensing images through dynamic feature enhancement and multimodal alignment fusion

Semantic segmentation with scale alignment and contextual information fusion for multimodal remote sensing images

Semantic segmentation of remote sensing images based on attention mechanism and feature fusion

Multi-Source Remote Sensing Images Semantic Segmentation Based on Differential Feature Attention Fusion