Most prominent temporal action localization methods are of the fully-supervised type, which rely heavily on frame-level labels, which could be prohibitively expensive to annotate. Thanks to recent developments on the Weakly-supervised Temporal Action Localization (W-TAL), this alternative paradigm requires only video-level labels in training, alleviating such annotation efforts. Specifically, we present Action Coherence Network (ACN) for W-TAL, which features a new coherence loss that better supervises action boundary learning and facilitate proposal regression. In addition, a purpose-built fusion module is proposed for localization inference based on features extracted by two streams of convolutional neural network. Overall, the proposed ACN achieves state-of-the-art W-TAL performance on two challenging datasets (THU-MOS14 and ActivityNet1.2, particularly ACN attains mAP of 24.2% on THUMOS14 under IoU threshold 0.5), which is approaching some recent fully-supervised TAL methods.
Yuanhao ZhaiLe WangWei TangQilin ZhangNanning ZhengGang Hua
Yuanbing ZouQingjie ZhaoProdip Kumar SarkerLe YangBinglu Wang
Linjiang HuangLiang WangHongsheng Li
Haoyi ShenJianhua LyuBaili Zhang
Wang LuoTianzhu ZhangWenfei YangJingen LiuTao MeiFeng WuYongdong Zhang