Recently, deep neural networks have been crucial techniques for image salient detection. However, two difficulties prevent the development of deep learning in video saliency detection. The first one is that the traditional static network cannot conduct a robust motion estimation in videos. The other is that the data-driven deep learning is in lack of sufficient manually annotated pixel-wise ground truths for video saliency network training. In this paper, we propose a multi-scale spatiotemporal convolutional LSTM network (MSST-ConvLSTM) to incorporate spatial and temporal cues for video salient objects detection. Furthermore, as manually pixel-wised labeling is very time-consuming, we sign lots of coarse labels, which are mixed with fine labels to train a robust saliency prediction model. Experiments on the widely used challenging benchmark datasets (e.g., FBMS and DAVIS) demonstrate that the proposed approach has competitive performance of video saliency detection compared with the state-of-the-art saliency models.
Yunzuo ZhangTian ZhangCunyu WuRan Tao
Zhenshan TanCheng ChenXiaodong Gu
Zhao LiuZhenyang WangXinhui SongChun Chen
Xiaofei ZhouWeipeng CaoHanxiao GaoMing ZhongJiyong Zhang
Yu ChenJing XiaoLiuyi HuDan ChenZhongyuan WangDengshi Li