JOURNAL ARTICLE

Event-Centric Hierarchical Representation for Dense Video Captioning

Teng WangHuicheng ZhengMingjing YuQian TianHaifeng Hu

Year: 2020 Journal:   IEEE Transactions on Circuits and Systems for Video Technology Vol: 31 (5)Pages: 1890-1900   Publisher: Institute of Electrical and Electronics Engineers

Abstract

Dense video captioning aims to localize and describe multiple events in untrimmed videos, which is a challenging task that draws attention recently in computer vision. Although existing methods have achieved impressive performance, most of them only focus on local information of event segments or very simple event-level context, overlooking the complexity of event-event relationship and the holistic scene. As a result, the coherence of captions within the same video could be damaged. In this article, we propose a novel event-centric hierarchical representation to alleviate this problem. We enhance the event-level representation by capturing rich relationship between events in terms of both temporal structure and semantic meaning. Then, a caption generator with late fusion is developed to generate surrounding-event-aware and topic-aware sentences, conditioned on the hierarchical representation of visual cues from the scene level, the event level, and the frame level. Furthermore, we propose a duplicate removal method, namely temporal-linguistic non-maximum suppression (TL-NMS) to distinguish redundancy in both localization and captioning stages. Quantitative and qualitative evaluations on the ActivityNet Captions and YouCook2 datasets demonstrate that our method improves the quality of generated captions and achieves state-of-the-art performance on most metrics.

Keywords:
Closed captioning Computer science Event (particle physics) Redundancy (engineering) Representation (politics) Artificial intelligence Context (archaeology) Natural language processing Focus (optics) Image (mathematics)

Metrics

88
Cited By
4.72
FWCI (Field Weighted Citation Impact)
63
Refs
0.96
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Analysis and Summarization
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.