JOURNAL ARTICLE

STIT: Spatio-Temporal Interaction Transformers for Human-Object Interaction Recognition in Videos

Muna AlmushytiFrederick W. B. Li

Year: 2022 Journal:   2022 26th International Conference on Pattern Recognition (ICPR) Pages: 3287-3294

Abstract

Recognizing human-object interactions is challenging due to their spatio-temporal changes. We propose the SpatioTemporal Interaction Transformer-based (STIT) network to reason such changes. Specifically, spatial transformers learn humans \nand objects context at specific frame time. Temporal transformer \nthen learns the relations at a higher level between spatial \ncontext representations at different time steps, capturing longterm dependencies across frames. We further investigate multiple \nhierarchy designs in learning human interactions. We achieved \nsuperior performance on Charades, Something-Something v1 \nand CAD-120 datasets, comparing to baseline models without \nlearning human-object relations, or with prior graph-based \nnetworks. We also achieved state-of-the-art accuracy of 95.93% \non CAD-120 dataset [1] by employing RGB data only.

Keywords:
Computer science Transformer Artificial intelligence RGB color model Spatial contextual awareness Graph Context model Pattern recognition (psychology) Machine learning Object (grammar) Computer vision Theoretical computer science Engineering

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
84
Refs
0.16
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Human Pose and Action Recognition
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Multimodal Machine Learning Applications
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
Video Surveillance and Tracking Methods
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition
© 2026 ScienceGate Book Chapters — All rights reserved.