Most of recent work on action recognition in video employ action parts, attributes etc. as mid- and high-level features to represent an action. However, these action parts, attributes subject to some aspects of weak discrimination and being difficult to obtain. In this paper, we present an approach that uses mid-level discriminative Spatial-Temporal Volume to recognize human actions. The spatial-temporal volume is represented by a Feature Graph which is constructed beyond on a local collection of feature points (e.g., cuboids, STIP) located in the corresponding spatial-temporal volume. Firstly, we densely sampling spatial-temporal volumes from training videos and construct a feature graph for each volume. Then, all feature graphs are clustered using spectral cluster method. We regard feature graphs as video words and characterize videos with the bag-of-features framework which we call it the bag-of-feature-graphs framework. While, in the process of clustering, the distance between two feature graphs is computed using an efficient spectral method. Final recognition is accomplished using a linear-SVM classifier. We test our algorithm in a publicly available human action dataset, the experimental results show the effectiveness of our method.
CuiWei LiuMingtao PeiXinxiao WuYu KongYunde Jia
Michalis RaptisI. KokkinosStefano Soatto
Junwu WengChaoqun WengJunsong YuanZicheng Liu