JOURNAL ARTICLE

Multimodal Video Summarization via Time-Aware Transformers

Abstract

With the growing number of videos in video sharing platforms, how to facilitate the searching and browsing of the user-generated video has attracted intense attention by multimedia community. To help people efficiently search and browse relevant videos, summaries of videos become important. The prior works in multimodal video summarization mainly explore visual and ASR tokens as two separate sources and struggle to fuse the multimodal information for generating the summaries. However, the time information inside videos is commonly ignored. In this paper, we find that it is important to leverage the timestamps to accurately incorporate multimodal signals for the task. We propose a Time-Aware Multimodal Transformer (TAMT) with a novel short-term order-sensitive attention mechanism. The attention mechanism can attend the inputs differently based on time difference to explore the time information inherent inside video more thoroughly. As such, TAMT can fuse the different modalities better for summarizing the videos. Experiments show that our proposed approach is effective and achieves the state-of-the-art performances on both YouCookII and open-domain How2 datasets.

Keywords:
Automatic summarization Computer science Timestamp Leverage (statistics) Fuse (electrical) Modalities Transformer Multimedia Human–computer interaction Artificial intelligence Real-time computing

Metrics

28
Cited By
2.45
FWCI (Field Weighted Citation Impact)
57
Refs
0.90
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Advanced Image and Video Retrieval Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Image Retrieval and Classification Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.