Xu LinChunmei QingJunpeng TanXiangmin Xu
The latest methods for saliency prediction on 360° images show that better results can be obtained using equirectangular (ERP) images as input. Due to the limitation of the receptive field, existing convolution-based networks cannot capture long-range information in complex 360° images. Although the transformer has the innate ability to capture long-range correlations with self-attention, large dataset requirement limit its application in saliency prediction of 360° images. In this paper, we present a novel Multi-scale Transformer framework for Saliency prediction on 360° images (MTSal360). The Multi-scale Transformer Module (MTM) is designed in the network to aggregate the contextual long-range information, which includes a Convolutional Positional Encoder (CPE) to enable the model could train and test on cubic and ERP format separately to address the insufficient data. Experiments on two public datasets illustrate that MTSal360 achieves better results over the state-of-the-art methods.
Marc AssensXavier Giró-i-NietoKevin McGuinnessNoel E. O’Connor
Chuong Hoang VoJui‐Chiu ChiangDuy H. LeThu NguyenTuan Van Pham
Xiaofei ZhouSonghe WuRan ShiBolun ZhengShuai WangHaibing YinJiyong ZhangChenggang Yan
Marc AssensXavier Giró-i-NietoKevin McGuinnessNoel E. O’Connor