An Efficient Multi-Scale Transformer Network with Fusion-Attention for Point Cloud–Semantic Segmentation in Urban Environments

Bo Guo; Naftaly Wambugu; Rui‐Sheng Wang; Zhihai Huang; Xiaolong Deng; Alex Hay‐Man Ng; Shengjun Tang; Wenchao Guo

doi:10.14358/pers.25-00016r3

ScienceGate Book Chapters

JOURNAL ARTICLE

An Efficient Multi-Scale Transformer Network with Fusion-Attention for Point Cloud–Semantic Segmentation in Urban Environments

Bo Guo Naftaly Wambugu Rui‐Sheng Wang Zhihai Huang Xiaolong Deng Alex Hay‐Man Ng Shengjun Tang Wenchao Guo

Year: 2025 Journal: Photogrammetric Engineering & Remote Sensing Publisher: American Society for Photogrammetry and Remote Sensing

DOI: 10.14358/pers.25-00016r3

Get Full-Text PDF Get Analytical Report

Abstract

This article investigates point-cloud segmentation, which is crucial but challenging for scene interpretation, especially for three-dimensional (3D) urban scenes at a city scale. Compared with the previous approaches, the proposed method gains a competitive advantage by leveraging an efficient multi-scale transformer, which complements the convolution in a hierarchical network to improve the representation ability with globally contextual information. More specifically, to address the problem of quadratic complexity that hinders large-scale point-cloud processing, a lightweight attention module with linear complexity is introduced by sequentially implementing channel and spatial attention to replace quadratic dot-product attention. Based on this lightweight attention module, an encoder based on a transformer is implemented to aggregate the feature sequence within a scale into a learnable token. To improve the efficiency of integrating information of multiple scales with no inductive bias, fusion attention is proposed, using only learned tokens to calculate the query, in which the complexity of the attention map can be bounded to be linear. The fusion-attention module is embedded in the multi-scale transformer to further expand the receptive field. The proposed method extends the previous hierarchical networks of point-cloud processing by incorporating the detailed information extracted via convolution and the globally contextual information extracted by the multi-scale transformer to greatly improve the representative ability of features for the accurate segmentation of point-cloud data. Two benchmark datasets (Dayton Annotated LiDAR Earth Scan [DALES] and Toronto-3D) were used to assess the proposed method. This method achieved an improvement of approximately 1.5% in mean intersection over union for semantic segmentation on the DALES dataset compared with the state-of-the-art methods. Meanwhile, an ablation study showed that consistent improvements were mainly attributed to the wide applicability of the efficient attention mechanism for enlarging the receptive field.

Keywords:

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

An Efficient Multi-Scale Transformer Network with Fusion-Attention for Point Cloud–Semantic Segmentation in Urban Environments

Abstract

Metrics

Topics

Related Documents

Urban-scale point cloud semantic segmentation with transformer

Multi-scale Network with Attentional Multi-resolution Fusion for Point Cloud Semantic Segmentation

Point Cloud Semantic Segmentation Network Based on Multi-Scale Feature Fusion

Multi-view Network with Transformer for Point Cloud Semantic Segmentation

MGTN: Multi-scale Graph Transformer Network for 3D Point Cloud Semantic Segmentation