In this paper, we propose a novel method for key-shots-based video summarization by introducing 3D Convnets with Multi-Attention. The process starts by encoding the video data into time-variant frames in 3D followed by two steps of visual attention. The first step learns attention weights for features inside each frame and the second step learns attention weights for all the frames hence deciding the importance score between 0 and 1 for each frame for the target summarization. The current state-of-the-art method used 2D Convnets with self-attention hence losing the dependency of each frame on the next which results in self-attention focusing on fewer features. The keyframes and their relation with time are not maintained. The experimental studies evaluating the proposed approach on two standard video summarization datasets (i) SumMe and (ii) TVSum produced significant improvements. We report new state-of-the-art for the task of video summarization on these datasets.
Shikha SharmaVijay Prakash Sharma
Yi-Lin SungCheng-Yao HongYen‐Chi HsuTyng-Luh Liu
Jingyun YanXiuyun CaoJun JiangXingxing Fang
Jingxu LinSheng-hua ZhongAhmed Fares