Attention mechanism has been widely used on the temporal task of video captioning and has shown promising improvements. However, in the decoding stage, some words belong to visual words have corresponding canonical visual signals, while other words such as "a" and "the" are non-visual words, which require no visual information but text information. Therefore, simply imposing attention on non-visual words may mislead the decoder and decrease the overall performance of video captioning. To tackle this issue, we propose a Hierarchical Multi-Attention Model named HMAM, which uses two independent attention mechanisms to make a soft-selection over frames feature and video attributes respectively, and then integrates another attention model to automatically decide when to rely on the visual signals and when to rely on the text information. Experiments on the benchmark dataset MSVD demonstrates that our method which only uses single feature achieves superior or even better results than the state-of-the-art baselines for video captioning.
M. HemalathaKarthik Periyasamy
Xiang LongChuang GanGerard de Melo
Chunlei WuYiwei WeiXiaoliang ChuWeichen SunFei SuLeiquan Wang
Maosheng ZhongYoude ChenHao ZhangHao XiongZhixiang Wang