Shuwen XiaoZhou ZhaoZijian ZhangZiyu GuanDeng Cai
This paper addresses the task of query-focused video summarization, which takes user queries and long videos as inputs and generates query-focused video summaries. Compared to video summarization, which mainly concentrates on finding the most diverse and representative visual contents as a summary, the task of query-focused video summarization considers the user's intent and the semantic meaning of generated summary. In this paper, we propose a method, named query-biased self-attentive network (QSAN) to tackle this challenge. Our key idea is to utilize the semantic information from video descriptions to generate a generic summary and then to combine the information from the query to generate a query-focused summary. Specifically, we first propose a hierarchical self-attentive network to model the relative relationship at three levels, which are different frames from a segment, different segments of the same video, textual information of video description and its related visual contents. We train the model on video caption dataset and employ a reinforced caption generator to generate a video description, which can help us locate important frames or shots. Then we build a query-aware scoring module to compute the query-relevant score for each shot and generate the query-focused summary. Extensive experiments on the benchmark dataset demonstrate the competitive performance of our approach compared to some methods.
Aidean SharghiBoqing GongMubarak Shah
Sheetal Pravin GiraseDevashish BoteVidya Dhopate
Shuwen XiaoZhou ZhaoZijian ZhangXiaohui YanMin Yang
K. X. ZhaoRan MaMin SuHongwen YuPing AnGongyang Li
Min SuRan MaBing ZhangKai LiPing An