Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction

Xiaofei Zhou; Songhe Wu; Ran Shi; Bolun Zheng; Shuai Wang; Haibing Yin; Jiyong Zhang; Chenggang Yan

doi:10.1109/tcsvt.2023.3278410

ScienceGate Book Chapters

JOURNAL ARTICLE

Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction

Xiaofei Zhou Songhe Wu Ran Shi Bolun Zheng Shuai Wang Haibing Yin Jiyong Zhang Chenggang Yan

Year: 2023 Journal: IEEE Transactions on Circuits and Systems for Video Technology Vol: 33 (12)Pages: 7696-7707 Publisher: Institute of Electrical and Electronics Engineers

DOI: 10.1109/tcsvt.2023.3278410

Get Full-Text PDF Get Analytical Report

Abstract

Most cutting-edge video saliency prediction models rely on spatiotemporal features extracted by 3D convolutions due to its local contextual cues acquirement ability. However, the shortage of 3D convolutions is that it cannot effectively capture long-term spatiotemporal dependencies in videos. To address this limitation, we propose a novel Transformer-based Multi-scale Feature Integration Network (TMFI-Net) for video saliency prediction, where the proposed TMFI-Net consists of a semantic-guided encoder and a hierarchical decoder. Firstly, embarking on the Transformer-based multi-level spatiotemporal features, the semantic-guided encoder enhances the features by inserting the high-level feature into each level feature via a top-down pathway and a longitudinal connection, which endows the multi-level spatiotemporal features with rich contextual information. In this way, the features are steered to give more concerns to saliency regions. Secondly, the hierarchical decoder employs a multi-dimensional attention (MA) module to elevate features along channel, temporal, and spatial dimensions jointly. Successively, the hierarchical decoder deploys a progressive decoding block to conduct an initial saliency prediction, which provides a coarse localization of saliency regions. Lastly, considering the complementarity of different saliency predictions, we integrate all initial saliency prediction results into the final saliency map. Comprehensive experimental results on four video saliency datasets firmly demonstrate that our model achieves superior performance when compared with the state-of-the-art video saliency models. The code is available at https://github.com/wusonghe/TMFI-Net .

Keywords:

Computer science Artificial intelligence Feature (linguistics) Feature extraction Pattern recognition (psychology) Scale (ratio) Computer vision

Metrics

Cited By

9.46

FWCI (Field Weighted Citation Impact)

Refs

0.98

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Visual Attention and Saliency Detection

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image and Video Quality Assessment

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image Fusion Techniques

Physical Sciences → Engineering → Media Technology

Transformer-Based Multi-Scale Feature Integration Network for Video Saliency Prediction

Abstract

Metrics

Citation History

Topics

Related Documents

Transformer-based multi-level attention integration network for video saliency prediction

Multi-Scale Spatiotemporal Feature Fusion Network for Video Saliency Prediction

TM2SP: A Transformer-Based Multi-Level Spatiotemporal Feature Pyramid Network for Video Saliency Prediction

Multi-Scale Transformer Network for Saliency Prediction on 360-Degree Images

Hierarchical Spatiotemporal Feature Fusion Network For Video Saliency Prediction