JOURNAL ARTICLE

Video Summarization Using Deep 3D ConvNets with Multi-Attention

Abstract

In this paper, we propose a novel method for key-shots-based video summarization by introducing 3D Convnets with Multi-Attention. The process starts by encoding the video data into time-variant frames in 3D followed by two steps of visual attention. The first step learns attention weights for features inside each frame and the second step learns attention weights for all the frames hence deciding the importance score between 0 and 1 for each frame for the target summarization. The current state-of-the-art method used 2D Convnets with self-attention hence losing the dependency of each frame on the next which results in self-attention focusing on fewer features. The keyframes and their relation with time are not maintained. The experimental studies evaluating the proposed approach on two standard video summarization datasets (i) SumMe and (ii) TVSum produced significant improvements. We report new state-of-the-art for the task of video summarization on these datasets.

Keywords:
Automatic summarization Computer science Frame (networking) Artificial intelligence Task (project management) Key (lock) Process (computing) Dependency (UML) Encoding (memory) Relation (database) Computer vision Pattern recognition (psychology) Data mining

Metrics

1
Cited By
0.18
FWCI (Field Weighted Citation Impact)
24
Refs
0.43
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Music and Audio Processing
Physical Sciences →  Computer Science →  Signal Processing
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.