In this paper we propose an alternative approach to the widely-used Bag-of-Features (BoF) for representing and automatically recognizing behaviors or actions in video sequences from sets of local spatio-temporal features extracted from the videos. Instead of histograms of visual words, in the proposed framework the sets of local spatio-temporal features extracted from each video are represented as low-dimensional linear subspaces, which are further othogonalized across classes to enhance their discriminability. Similarity between videos is represented in terms of Grassmann kernels defined on the subspaces of spatio-temporal features. Experimental results on a publicly available video dataset related to classifying rodent behavior demonstrate the effectiveness of the proposed framework.
Heng WangMuhammad Muneeb UllahAlexander KläserIvan LaptevCordelia Schmid
Kai YangJi‐Xiang DuChuan-Min Zhai
Chen MengLiyu GongTianjiang WangQi Feng