How to effectively model both spatial information and temporal dynamics is crucial to Video Salient Object Detection (VSOD). Recently, there are some works using self-attention mechanism to capture the spatiotemporal information due to its ability of modeling long-range dependencies of patch tokens. However, these models designate similar receptive fields of the spatiotemporal feature maps, which limits the ability of the models in handling the frames with multiple salient objects of different scales. To address this issue, we propose a Multi-Scale Self-Attention (MSSA) operation to better model the spatiotemporal features of salient objects with different scales. The experimental results demonstrate that our method achieves better performance on challenge datasets by using MSSA operation.
Yuzhu JiHaijun ZhangQ. M. Jonathan Wu
Shanmei LuQiang GuoRen WangCaiming Zhang
Jing ZhangYuchao DaiBo LiMingyi He
Chenchu XuZhifan GaoHeye ZhangShuo LiVictor Hugo C. de Albuquerque
Yuchao GuLijuan WangZiqin WangYun LiuMing‐Ming ChengShao-Ping Lu