Muna AlmushytiFrederick W. B. Li
Recognizing human-object interactions is challenging due to their spatio-temporal changes. We propose the SpatioTemporal Interaction Transformer-based (STIT) network to reason such changes. Specifically, spatial transformers learn humans \nand objects context at specific frame time. Temporal transformer \nthen learns the relations at a higher level between spatial \ncontext representations at different time steps, capturing longterm dependencies across frames. We further investigate multiple \nhierarchy designs in learning human interactions. We achieved \nsuperior performance on Charades, Something-Something v1 \nand CAD-120 datasets, comparing to baseline models without \nlearning human-object relations, or with prior graph-based \nnetworks. We also achieved state-of-the-art accuracy of 95.93% \non CAD-120 dataset [1] by employing RGB data only.
Ning WangGuangming ZhuLiang ZhangPeiyi ShenHongsheng LiCong Hua
K. ShankarK. Chandra SekarRenuka Devi SB MonishV PrathapRVS Praveen
Víctor EscorciaJuan Carlos Niebles
Ning WangGuangming ZhuHongsheng LiMingtao FengXia ZhaoLan NiPeiyi ShenLin MeiLiang Zhang