Transformer-based Semantic Segmentation for Large-Scale Building Footprint Extraction from Very-High Resolution Satellite Images

Mohamed Barakat A. Gibril; Rami Al‐Ruzouq; Abdallah Shanableh; Ratiranjan Jena; Jan Bolcek; Helmi Zulhaidi Mohd Shafri; Omid Ghorbanzadeh

doi:10.60692/jfef8-7hs07

JOURNAL ARTICLE

Transformer-based Semantic Segmentation for Large-Scale Building Footprint Extraction from Very-High Resolution Satellite Images

Mohamed Barakat A. Gibril Rami Al‐Ruzouq Abdallah Shanableh Ratiranjan Jena Jan Bolcek Helmi Zulhaidi Mohd Shafri Omid Ghorbanzadeh

Year: 2024 Journal: Greater South Information System

DOI: 10.60692/jfef8-7hs07

Get Full-Text PDF Get Analytical Report

Abstract

Extracting building footprints from extensive very-high spatial resolution (VHSR) remote sensing data is crucial for diverse applications, including surveying, urban studies, population estimation, identification of informal settlements, and disaster management. Although convolutional neural networks (CNNs) are commonly utilized for this purpose, their effectiveness is constrained by limitations in capturing long-range relationships and contextual details due to the localized nature of convolution operations. This study introduces the masked-attention mask transformer (Mask2Former), based on the Swin Transformer, for building footprint extraction from large-scale satellite imagery. To enhance the capture of large-scale semantic information and extract multiscale features, a hierarchical vision transformer with shifted windows (Swin Transformer) serves as the backbone network. An extensive analysis compares the efficiency and generalizability of Mask2Former with four CNN models (PSPNet, DeepLabV3+, UpperNet-ConvNext, and SegNeXt) and two transformer-based models (UpperNet-Swin and SegFormer) featuring different complexities. Results reveal superior performance of transformer-based models over CNN-based counterparts, showcasing exceptional generalization across diverse testing areas with varying building structures, heights, and sizes. Specifically, Mask2Former with the Swin transformer backbone achieves a mean intersection over union between 88% and 93%, along with a mean F-score (mF-score) ranging from 91% to 96.35% across various urban landscapes.

Keywords:

Convolutional neural network Segmentation Footprint Feature extraction Toolbox Population Pattern recognition (psychology) Generalizability theory Transformer Ranging

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

0.43

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Automated Road and Building Extraction

Physical Sciences → Engineering → Ocean Engineering

Remote-Sensing Image Classification

Physical Sciences → Engineering → Media Technology

Advanced Neural Network Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Transformer-based Semantic Segmentation for Large-Scale Building Footprint Extraction from Very-High Resolution Satellite Images

Abstract

Metrics

Topics

Related Documents

Transformer-based Semantic Segmentation for Large-Scale Building Footprint Extraction from Very-High Resolution Satellite Images

Transformer-based semantic segmentation for large-scale building footprint extraction from very-high resolution satellite images

Transformer-Based Semantic Segmentation for Extraction of Building Footprints from Very-High-Resolution Images

Semantic Segmentation-Based Building Footprint Extraction Using Very High-Resolution Satellite Images and Multi-Source GIS Data

Building footprint extraction from very high-resolution satellite images using deep learning