JOURNAL ARTICLE

Multi-Scale Transformer Network for Saliency Prediction on 360-Degree Images

Abstract

The latest methods for saliency prediction on 360° images show that better results can be obtained using equirectangular (ERP) images as input. Due to the limitation of the receptive field, existing convolution-based networks cannot capture long-range information in complex 360° images. Although the transformer has the innate ability to capture long-range correlations with self-attention, large dataset requirement limit its application in saliency prediction of 360° images. In this paper, we present a novel Multi-scale Transformer framework for Saliency prediction on 360° images (MTSal360). The Multi-scale Transformer Module (MTM) is designed in the network to aggregate the contextual long-range information, which includes a Convolutional Positional Encoder (CPE) to enable the model could train and test on cubic and ERP format separately to address the insufficient data. Experiments on two public datasets illustrate that MTSal360 achieves better results over the state-of-the-art methods.

Keywords:
Computer science Transformer Encoder Artificial intelligence Pattern recognition (psychology) Convolutional neural network Computer vision Data mining Engineering

Metrics

1
Cited By
0.18
FWCI (Field Weighted Citation Impact)
21
Refs
0.42
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Visual Attention and Saliency Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image and Video Quality Assessment
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image Fusion Techniques
Physical Sciences →  Engineering →  Media Technology
© 2026 ScienceGate Book Chapters — All rights reserved.