Multimodal Video Summarization via Time-Aware Transformers

Xindi Shang; Zehuan Yuan; Anran Wang; Changhu Wang

doi:10.1145/3474085.3475321

ScienceGate Book Chapters

JOURNAL ARTICLE

Multimodal Video Summarization via Time-Aware Transformers

Xindi Shang Zehuan Yuan Anran Wang Changhu Wang

Year: 2021 Pages: 1756-1765

DOI: 10.1145/3474085.3475321

Get Full-Text PDF Get Analytical Report

Abstract

With the growing number of videos in video sharing platforms, how to facilitate the searching and browsing of the user-generated video has attracted intense attention by multimedia community. To help people efficiently search and browse relevant videos, summaries of videos become important. The prior works in multimodal video summarization mainly explore visual and ASR tokens as two separate sources and struggle to fuse the multimodal information for generating the summaries. However, the time information inside videos is commonly ignored. In this paper, we find that it is important to leverage the timestamps to accurately incorporate multimodal signals for the task. We propose a Time-Aware Multimodal Transformer (TAMT) with a novel short-term order-sensitive attention mechanism. The attention mechanism can attend the inputs differently based on time difference to explore the time information inherent inside video more thoroughly. As such, TAMT can fuse the different modalities better for summarizing the videos. Experiments show that our proposed approach is effective and achieves the state-of-the-art performances on both YouCookII and open-domain How2 datasets.

Keywords:

Automatic summarization Computer science Timestamp Leverage (statistics) Fuse (electrical) Modalities Transformer Multimedia Human–computer interaction Artificial intelligence Real-time computing

Metrics

Cited By

2.45

FWCI (Field Weighted Citation Impact)

Refs

0.90

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Video Analysis and Summarization

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Advanced Image and Video Retrieval Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Image Retrieval and Classification Techniques

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Video Summarization via Time-Aware Transformers

Abstract

Metrics

Citation History

Topics

Related Documents

Multimodal Video Summarization using Attention based Transformers (MVSAT)

Topic-aware video summarization using multimodal transformer

Video summarization via knowledge-aware multimodal deep networks

Hierarchical Time-Aware Approach for Video Summarization

Topic-aware Multimodal Summarization