JOURNAL ARTICLE

CNN–Transformer Hybrid Architecture for Underwater Sonar Image Segmentation

Juan LeiHuigang WangZhiyu LeiJiayuan LiShaowei Rong

Year: 2025 Journal:   Remote Sensing Vol: 17 (4)Pages: 707-707   Publisher: Multidisciplinary Digital Publishing Institute

Abstract

The salient object detection (SOD) of forward-looking sonar images plays a crucial role in underwater detection and rescue tasks. However, the existing SOD algorithms find it difficult to effectively extract salient features and spatial structure information from images with scarce semantic information, uneven intensity distribution, and high noise. Convolutional neural networks (CNNs) have strong local feature extraction capabilities, but they are easily constrained by the receptive field and lack the ability to model long-range dependencies. Transformers, with their powerful self-attention mechanism, are capable of modeling the global features of a target, but they tend to lose a significant amount of local detail. Mamba effectively models long-range dependencies in long sequence inputs through a selection mechanism, offering a novel approach to capturing long-range correlations between pixels. However, since the saliency of image pixels does not exhibit sequential dependencies, this somewhat limits Mamba’s ability to fully capture global contextual information during the forward pass. Inspired by multimodal feature fusion learning, we propose a hybrid CNN–Transformer–Mamba architecture, termed FLSSNet. FLSSNet is built upon a CNN and Transformer backbone network, integrating four core submodules to address various technical challenges: (1) The asymmetric dual encoder–decoder (ADED) is capable of simultaneously extracting features from different modalities and systematically modeling both local contextual information and global spatial structure. (2) The Transformer feature converter (TFC) module optimizes the multimodal feature fusion process through feature transformation and channel compression. (3) The long-range correlation attention (LRCA) module enhances CNN’s ability to model long-range dependencies through the collaborative use of convolutional kernels, selective sequential scanning, and attention mechanisms, while effectively suppressing noise interference. (4) The recursive contour refinement (RCR) model refines edge contour information through a layer-by-layer recursive mechanism, achieving greater precision in boundary details. The experimental results show that FLSSNet exhibits outstanding competitiveness among 25 state-of-the-art SOD methods, achieving MAE and Eξ values of 0.04 and 0.973, respectively.

Keywords:
Underwater Computer science Sonar Transformer Artificial intelligence Architecture Computer vision Marine engineering Geology Oceanography Geography Electrical engineering Engineering

Metrics

8
Cited By
38.19
FWCI (Field Weighted Citation Impact)
82
Refs
0.99
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Image Enhancement Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Neural Network Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Underwater Acoustics Research
Physical Sciences →  Earth and Planetary Sciences →  Oceanography
© 2026 ScienceGate Book Chapters — All rights reserved.