Xindi ShangZehuan YuanAnran WangChanghu Wang
With the growing number of videos in video sharing platforms, how to facilitate the searching and browsing of the user-generated video has attracted intense attention by multimedia community. To help people efficiently search and browse relevant videos, summaries of videos become important. The prior works in multimodal video summarization mainly explore visual and ASR tokens as two separate sources and struggle to fuse the multimodal information for generating the summaries. However, the time information inside videos is commonly ignored. In this paper, we find that it is important to leverage the timestamps to accurately incorporate multimodal signals for the task. We propose a Time-Aware Multimodal Transformer (TAMT) with a novel short-term order-sensitive attention mechanism. The attention mechanism can attend the inputs differently based on time difference to explore the time information inherent inside video more thoroughly. As such, TAMT can fuse the different modalities better for summarizing the videos. Experiments show that our proposed approach is effective and achieves the state-of-the-art performances on both YouCookII and open-domain How2 datasets.
Kushagra SinghR. PranavPavan NuthiNikhil Raju MohiteH. R. Mamatha
Yubo ZhuWentian ZhaoRui HuaXinxiao Wu
Jiehang XieXuanbai ChenSicheng ZhaoShao-Ping Lu
Leonardo Vilela CardosoGustavo Oliveira Rocha GomesSilvio Jamil F. GuimarãesZenilton K. G. Patrocínio
Sourajit MukherjeeAnubhav JangraSriparna SahaAdam Jatowt