JOURNAL ARTICLE

Constructing Holistic Spatio-Temporal Scene Graph for Video Semantic Role Labeling

Abstract

As one of the core video semantic understanding tasks, Video Semantic Role Labeling (VidSRL) aims to detect the salient events from given videos, by recognizing the predict-argument event structures and the interrelationships between events. While recent endeavors have put forth methods for VidSRL, they can be mostly subject to two key drawbacks, including the lack of fine-grained spatial scene perception and the insufficiently modeling of video temporality. Towards this end, this work explores a novel holistic spatio-temporal scene graph (namely HostSG) representation based on the existing dynamic scene graph structures, which well model both the fine-grained spatial semantics and temporal dynamics of videos for VidSRL. Built upon the HostSG, we present a nichetargeting VidSRL framework. A scene-event mapping mechanism is first designed to bridge the gap between the underlying scene structure and the high-level event semantic structure, resulting in an overall hierarchical scene-event (termed ICE) graph structure. We further perform iterative structure refinement to optimize the ICE graph, e.g., filtering noisy branches and newly building informative connections, such that the overall structure representation can best coincide with end task demand. Finally, three subtask predictions of VidSRL are jointly decoded, where the end-to-end paradigm effectively avoids error propagation. On the benchmark dataset, our framework boosts significantly over the current best-performing model. Further analyses are shown for a better understanding of the advances of our methods. Our HostSG representation shows greater potential to facilitate a broader range of other video understanding tasks.

Keywords:
Computer science Scene graph Graph Event (particle physics) Semantics (computer science) Artificial intelligence Salient Representation (politics) Event structure Theoretical computer science Rendering (computer graphics)

Metrics

29
Cited By
5.28
FWCI (Field Weighted Citation Impact)
46
Refs
0.95
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Cancer-related molecular mechanisms research
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Cancer Research

Related Documents

BOOK-CHAPTER

Meta Spatio-Temporal Debiasing for Video Scene Graph Generation

Xu LiHaoxuan QuJason KuenJiuxiang GuJun Liu

Lecture notes in computer science Year: 2022 Pages: 374-390
JOURNAL ARTICLE

VR+HD: Video Semantic Reconstruction From Spatio-Temporal Scene Graphs

Chenxing LiYiping DuanQiyuan DuShiqi SunXin DengXiaoming Tao

Journal:   IEEE Journal of Selected Topics in Signal Processing Year: 2023 Vol: 17 (5)Pages: 935-948
JOURNAL ARTICLE

Spatio-Temporal Graph-based Semantic Compositional Network for Video Captioning

Shun LiZefan ZhangYi JiYing LiChunping Liu

Journal:   2022 International Joint Conference on Neural Networks (IJCNN) Year: 2022
JOURNAL ARTICLE

Cross-Attentional Spatio-Temporal Semantic Graph Networks for Video Question Answering

Yun LiuXiaoming ZhangFeiran HuangBo ZhangZhoujun Li

Journal:   IEEE Transactions on Image Processing Year: 2022 Vol: 31 Pages: 1684-1696
© 2026 ScienceGate Book Chapters — All rights reserved.