Guozhang LiJie LiNannan WangXinpeng DingZhifeng LiXinbo Gao
Weakly Supervised Temporal Action Localization (WTAL) aims to localize action segments in untrimmed videos with only video-level category labels in the training phase. In WTAL, an action generally consists of a series of sub-actions, and different categories of actions may share the common sub-actions. However, to distinguish different categories of actions with only video-level class labels, current WTAL models tend to focus on discriminative sub-actions of the action, while ignoring those common sub-actions shared with different categories of actions. This negligence of common sub-actions would lead to the located action segments incomplete, i.e., only containing discriminative sub-actions. Different from current approaches of designing complex network architectures to explore more complete actions, in this paper, we introduce a novel supervision method named multi-hierarchical category supervision (MHCS) to find more sub-actions rather than only the discriminative ones. Specifically, action categories sharing similar sub-actions will be constructed as super-classes through hierarchical clustering. Hence, training with the new generated super-classes would encourage the model to pay more attention to the common sub-actions, which are ignored training with the original classes. Furthermore, our proposed MHCS is model-agnostic and non-intrusive, which can be directly applied to existing methods without changing their structures. Through extensive experiments, we verify that our supervision method can improve the performance of four state-of-the-art WTAL methods on three public datasets: THUMOS14, ActivityNet1.2, and ActivityNet1.3.
Chen JuPeisen ZhaoSiheng ChenYa ZhangXiaoyun ZhangYanfeng WangQi Tian
Mamshad Nayeem RizveGaurav MittalYe YuMatt HallSandra SajeevMubarak ShahMei Chen
Xin HuKai LiDeep PatelErik KruusMartin Renqiang MinZhengming Ding
Mengxue LiuWen J. LiFangzhen GeXiangjun Gao