JOURNAL ARTICLE

Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction

Xiaofei ZhouSonghe WuRan ShiBolun ZhengShuai WangHaibing YinJiyong ZhangChenggang Yan

Year: 2023 Journal:   IEEE Transactions on Circuits and Systems for Video Technology Vol: 33 (12)Pages: 7696-7707   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Most cutting-edge video saliency prediction models rely on spatiotemporal features extracted by 3D convolutions due to its local contextual cues acquirement ability. However, the shortage of 3D convolutions is that it cannot effectively capture long-term spatiotemporal dependencies in videos. To address this limitation, we propose a novel Transformer-based Multi-scale Feature Integration Network (TMFI-Net) for video saliency prediction, where the proposed TMFI-Net consists of a semantic-guided encoder and a hierarchical decoder. Firstly, embarking on the Transformer-based multi-level spatiotemporal features, the semantic-guided encoder enhances the features by inserting the high-level feature into each level feature via a top-down pathway and a longitudinal connection, which endows the multi-level spatiotemporal features with rich contextual information. In this way, the features are steered to give more concerns to saliency regions. Secondly, the hierarchical decoder employs a multi-dimensional attention (MA) module to elevate features along channel, temporal, and spatial dimensions jointly. Successively, the hierarchical decoder deploys a progressive decoding block to conduct an initial saliency prediction, which provides a coarse localization of saliency regions. Lastly, considering the complementarity of different saliency predictions, we integrate all initial saliency prediction results into the final saliency map. Comprehensive experimental results on four video saliency datasets firmly demonstrate that our model achieves superior performance when compared with the state-of-the-art video saliency models. The code is available at https://github.com/wusonghe/TMFI-Net .

Keywords:
Computer science Artificial intelligence Feature (linguistics) Feature extraction Pattern recognition (psychology) Scale (ratio) Computer vision

Metrics

52
Cited By
9.46
FWCI (Field Weighted Citation Impact)
94
Refs
0.98
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Visual Attention and Saliency Detection
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image and Video Quality Assessment
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image Fusion Techniques
Physical Sciences →  Engineering →  Media Technology

Related Documents

JOURNAL ARTICLE

Transformer-based multi-level attention integration network for video saliency prediction

Rui TanMinghui SunYanhua Liang

Journal:   Multimedia Tools and Applications Year: 2024 Vol: 84 (13)Pages: 11833-11854
JOURNAL ARTICLE

Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction

Yunzuo ZhangTian ZhangCunyu WuRan Tao

Journal:   IEEE Transactions on Multimedia Year: 2023 Vol: 26 Pages: 4183-4193
JOURNAL ARTICLE

TM2SP: A Transformer-Based Multi-Level Spatiotemporal Feature Pyramid Network for Video Saliency Prediction

C.L. LiShiguang Liu

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2025 Vol: 35 (6)Pages: 5236-5250
© 2026 ScienceGate Book Chapters — All rights reserved.