JOURNAL ARTICLE

An Efficient Multi-Scale Transformer Network with Fusion-Attention for Point Cloud–Semantic Segmentation in Urban Environments

Bo GuoNaftaly WambuguRui‐Sheng WangZhihai HuangXiaolong DengAlex Hay‐Man NgShengjun TangWenchao Guo

Year: 2025 Journal:   Photogrammetric Engineering & Remote Sensing   Publisher: American Society for Photogrammetry and Remote Sensing

Abstract

This article investigates point-cloud segmentation, which is crucial but challenging for scene interpretation, especially for three-dimensional (3D) urban scenes at a city scale. Compared with the previous approaches, the proposed method gains a competitive advantage by leveraging an efficient multi-scale transformer, which complements the convolution in a hierarchical network to improve the representation ability with globally contextual information. More specifically, to address the problem of quadratic complexity that hinders large-scale point-cloud processing, a lightweight attention module with linear complexity is introduced by sequentially implementing channel and spatial attention to replace quadratic dot-product attention. Based on this lightweight attention module, an encoder based on a transformer is implemented to aggregate the feature sequence within a scale into a learnable token. To improve the efficiency of integrating information of multiple scales with no inductive bias, fusion attention is proposed, using only learned tokens to calculate the query, in which the complexity of the attention map can be bounded to be linear. The fusion-attention module is embedded in the multi-scale transformer to further expand the receptive field. The proposed method extends the previous hierarchical networks of point-cloud processing by incorporating the detailed information extracted via convolution and the globally contextual information extracted by the multi-scale transformer to greatly improve the representative ability of features for the accurate segmentation of point-cloud data. Two benchmark datasets (Dayton Annotated LiDAR Earth Scan [DALES] and Toronto-3D) were used to assess the proposed method. This method achieved an improvement of approximately 1.5% in mean intersection over union for semantic segmentation on the DALES dataset compared with the state-of-the-art methods. Meanwhile, an ablation study showed that consistent improvements were mainly attributed to the wide applicability of the efficient attention mechanism for enlarging the receptive field.

Keywords:

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

© 2026 ScienceGate Book Chapters — All rights reserved.