Action recognition is one of the representative perception tasks for robot application, but it still remains challenging due to complex temporal dynamics. Although temporal shift module (TSM) has been considered to be one of the best 2D CNN based architecture for temporal modeling, its inherent structural simplicity limits performance and has room for improvement. To mitigate this issue while following TSM's philosophy, this paper presents a variant of TSM, termed as Discriminative TSM (D-TSM), with a focus on capturing dis-criminative features for motion pattern. Going further from the naive shift operation in TSM, our D-TSM explicitly transforms shifted features by applying element-wise subtraction. This simple approach is effective to create discriminative features between adjacent frames with a small extra computational cost and zero parameter. The experiments on Something-Something and Jester datasets demonstrate that our D-TSM outperforms TSM and achieves competitive performance with low FLOPs against other methods.
Anh-Kiet DuongPetra Gomez‐Krämer
Zhaoqilin YangGaoyun AnRuichen Zhang
Kunpeng ZhangM LyuXinxin GuoLiye ZhangCong Liu
Mohan Singh AdityaSowmya RasipuramAnutosh MaitraBaran Pouyan Maziyar