In this paper, we propose a novel supervised method for summarizing long-length videos. Many recent approaches presented promising results in video summarization. However, videos in most benchmark datasets are short in duration (<; 10 minutes), and the methods often do not work well for very long-length videos (>1 hour). Furthermore, most approaches only use visual features, while audios provide useful information for the task. Based on these observations, we present a model that exploits both audio and visual features. To handle long videos, the hierarchical structure of our model captures both the short-term and long-term temporal dependencies. Our model also refines the extracted features using adversarial networks. To demonstrate our model, we have collected a new dataset of 28 baseball (~3.5 hours) videos, accompanied by an editorial summary video that is 5% in length of the original video. Evaluation on the dataset suggests that our method produces quality summaries for very long videos.
Lokesh kumar Thandaga NagarajuBandi RanjithaJahanara Shaik
Pei DongZhiyong WangZhuo LiDagan Feng