Dingyao MinChao ZhangYukang LuKeren FuQijun Zhao
Video salient object detection (VSOD) aims at locating the most attractive objects presented in video sequences by exploiting spatial and temporal cues. Previous methods mainly utilize convolutional neural networks (CNNs) to fuse or complement across RGB and optical flow cues via simple strategies. To take full advantage of CNNs and recently emerged Transformers, this letter proposes a novel mutual-guidance Transformer-embedding network, called MGT-Net, where a mutual-guidance multi-head attention mechanism (MGMA) explores more sophisticated long-range cross-modal interactions. Such a mechanism is designed into a new mutual-guidance Transformer (MGTrans) module that can propagate long-range contextual dependencies based on information of the other modality. To the best of our knowledge, MGT-Net is the first VSOD model that embeds Transformers as modules into CNNs for improved performance. Prior to MGTrans, we also propose and deploy a feature purification module (FPM) to purify noisy backbone features. Experimental results on five benchmark datasets demonstrate the state-of-the-art performance of MGT-Net.
Yingxia JiaoXiao WangYu-Cheng ChouShouyuan YangGe-Peng JiRong ZhuGe Gao
Kan HuangChunwei TianJingyong SuJerry Chun‐Wei Lin
Xingzheng WangSongwei ChenGuoyao WeiJiehao Liu
Hongbo BiLina YangHuihui ZhuDi LuJianguo Jiang