Hongbo BiDi LuNing LiLina YangHua-Ping Guan
This paper proposes a fast detection model for video salient objects based on recurrent network architecture. Firstly, a multi-level attention (MLA) module is designed, which integrates multi-level feature maps in a cascaded manner. It effectively extracts the semantic information and detailed information of the intra-frame. These spatial features are input into a deeper bidirectional ConvLSTM to learn temporal dependence. Secondly, the result of the forward flow output is used as a backward input, and deeper temporal dependence is extracted. Finally, we present a spatial-temporal fused bidirectional ConvLSTM framework, which reduces the accumulated memory in the bidirectional ConvLSTM by exploiting element level fusion strategy. The experimental results show that the proposed method achieves the best detection precision on the two challenging benchmarks: ViSal and FBMS datasets, with a real-time speed of 23 fps.
Zhao LiuZhenyang WangXinhui SongChun Chen
Weisheng LiSiqin FengHua-Ping GuanZiwei ZhanGong Cheng
Rahma KalboussiMehrez AbdellaouiAli Douik
AlokThakur AlokThakurNiraj Tiwari
Qian LiShifeng ChenBeiwei Zhang