Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Ning Wang; Guangming Zhu; Liang Zhang; Peiyi Shen; Hongsheng Li; Cong Hua

doi:10.1145/3474085.3475636

ScienceGate Book Chapters

JOURNAL ARTICLE

Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Ning Wang Guangming Zhu Liang Zhang Peiyi Shen Hongsheng Li Cong Hua

Year: 2021 Pages: 4985-4993

DOI: 10.1145/3474085.3475636

Get Full-Text PDF Get Analytical Report

Abstract

For a given video-based Human-Object Interaction scene, modeling the spatio-temporal relationship between humans and objects is the important cue to understand the contextual information presented in the video. With the efficient spatio-temporal relationship modeling, it is possible not only to uncover contextual information in each frame, but to directly capture inter-frame dependencies as well. Capturing the position changes of human and objects over the spatio-temporal dimension is more critical when significant changes in the appearance features may not occur over time. When utilizing appearance features, the spatial location and the semantic information are also the key to improve the video-based Human-Object Interaction recognition performance. In this paper, Spatio-Temporal Interaction Graph Parsing Networks (STIGPN) are constructed, which encode the videos with a graph composed of human and object nodes. These nodes are connected by two types of relations: (i) intra-frame relations: modeling the interactions between human and the interacted objects within each frame. (ii) inter-frame relations: capturing the long range dependencies between human and the interacted objects across frame. With the graph, STIGPN learn spatio-temporal features directly from the whole video-based Human-Object Interaction scenes. Multi-modal features and a multi-stream fusion strategy are used to enhance the reasoning capability of STIGPN. Two Human-Object Interaction video datasets, including CAD-120 and Something-Else, are used to evaluate the proposed architectures, and the state-of-the-art performance demonstrates the superiority of STIGPN. Code for STIGPN is available at https://github.com/GuangmingZhu/STIGPN.

Keywords:

Computer science ENCODE Parsing Artificial intelligence Object (grammar) Graph Frame (networking) Computer vision Spatial relation Pattern recognition (psychology) Theoretical computer science

Metrics

Cited By

2.25

FWCI (Field Weighted Citation Impact)

Refs

0.89

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Human Pose and Action Recognition

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Multimodal Machine Learning Applications

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Video Surveillance and Tracking Methods

Physical Sciences → Computer Science → Computer Vision and Pattern Recognition

Spatio-Temporal Interaction Graph Parsing Networks for Human-Object Interaction Recognition

Abstract

Metrics

Citation History

Topics

Related Documents

Exploring Spatio–Temporal Graph Convolution for Video-Based Human–Object Interaction Recognition

STIT: Spatio-Temporal Interaction Transformers for Human-Object Interaction Recognition in Videos

Language-guided graph parsing attention network for human-object interaction recognition

Cascaded Parsing of Human-Object Interaction Recognition

iCGPN: Interaction-centric graph parsing network for human-object interaction detection