JOURNAL ARTICLE

Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Abstract

For a given video-based Human-Object Interaction scene, modeling the spatio-temporal relationship between humans and objects is the important cue to understand the contextual information presented in the video. With the efficient spatio-temporal relationship modeling, it is possible not only to uncover contextual information in each frame, but to directly capture inter-frame dependencies as well. Capturing the position changes of human and objects over the spatio-temporal dimension is more critical when significant changes in the appearance features may not occur over time. When utilizing appearance features, the spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks (STIGPN) are constructed, which encode the videos with a graph composed of human and object nodes. These nodes are connected by two types of relations: (i) intra-frame relations: modeling the interactions between human and the interacted objects within each frame. (ii) inter-frame relations: capturing the long range dependencies between human and the interacted objects across frame. With the graph, STIGPN learn spatio-temporal features directly from the whole video-based Human-Object Interaction scenes. Multi-modal features and a multi-stream fusion strategy are used to enhance the reasoning capability of STIGPN. Two Human-Object Interaction video datasets, including CAD-120 and Something-Else, are used to evaluate the proposed architectures, and the state-of-the-art performance demonstrates the superiority of STIGPN. Code for STIGPN is available at https://github.com/GuangmingZhu/STIGPN.

Keywords:
Computer science ENCODE Parsing Artificial intelligence Object (grammar) Graph Frame (networking) Computer vision Spatial relation Pattern recognition (psychology) Theoretical computer science

Metrics

27
Cited By
2.25
FWCI (Field Weighted Citation Impact)
48
Refs
0.89
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

JOURNAL ARTICLE

Exploring Spatio–Temporal Graph Convolution for Video-Based Human–Object Interaction Recognition

Ning WangGuangming ZhuHongsheng LiMingtao FengXia ZhaoLan NiPeiyi ShenLin MeiLiang Zhang

Journal:   IEEE Transactions on Circuits and Systems for Video Technology Year: 2023 Vol: 33 (10)Pages: 5814-5827
JOURNAL ARTICLE

STIT: Spatio-Temporal Interaction Transformers for Human-Object Interaction Recognition in Videos

Muna AlmushytiFrederick W. B. Li

Journal:   2022 26th International Conference on Pattern Recognition (ICPR) Year: 2022 Pages: 3287-3294
JOURNAL ARTICLE

Language-guided graph parsing attention network for human-object interaction recognition

Qiyue LiXuemei XieJin ZhangGuangming Shi

Journal:   Journal of Visual Communication and Image Representation Year: 2022 Vol: 89 Pages: 103640-103640
JOURNAL ARTICLE

Cascaded Parsing of Human-Object Interaction Recognition

Tianfei ZhouSiyuan QiWenguan WangJianbing ShenSong‐Chun Zhu

Journal:   IEEE Transactions on Pattern Analysis and Machine Intelligence Year: 2021 Vol: 44 (6)Pages: 2827-2840
© 2026 ScienceGate Book Chapters — All rights reserved.