Qiong HuLei QinQingming HuangShuqiang JiangQi Tian
The spatial-temporal local features and the bag of words representation have been widely used in the action recognition field. However, this framework usually neglects the internal spatial-temporal relations between video-words, resulting in ambiguity in action recognition task, especially for videos "in the wild". In this paper, we solve this problem by utilizing the volumetric context around a video-word. Here, a local histogram of video-words distribution is calculated, which is referred as the "context" and further clustered into contextual words. To effectively use the contextual information, the descriptive video-phrases (ST-DVPs) and the descriptive video-cliques (ST-DVCs) are proposed. A general framework for ST-DVP and ST-DVC generation is described, and then action recognition can be done based on all these representations and their combinations. The proposed method is evaluated on two challenging human action datasets: the KTH dataset and the YouTube dataset. Experiment results confirm the validity of our approach.
Xuehao GaoYang YangYang WuShaoyi Du
Cheng DaiXingang LiuLuhao ZhongTao Yu
Xiaolin SongCuiling LanWenjun ZengJunliang XingXiaoyan SunJingyu Yang