Multimodal machine translation (MMT), which mainly focuses on enhancing\ntext-only translation with visual features, has attracted considerable\nattention from both computer vision and natural language processing\ncommunities. Most current MMT models resort to attention mechanism, global\ncontext modeling or multimodal joint representation learning to utilize visual\nfeatures. However, the attention mechanism lacks sufficient semantic\ninteractions between modalities while the other two provide fixed visual\ncontext, which is unsuitable for modeling the observed variability when\ngenerating translation. To address the above issues, in this paper, we propose\na novel Dynamic Context-guided Capsule Network (DCCN) for MMT. Specifically, at\neach timestep of decoding, we first employ the conventional source-target\nattention to produce a timestep-specific source-side context vector. Next, DCCN\ntakes this vector as input and uses it to guide the iterative extraction of\nrelated visual features via a context-guided dynamic routing mechanism.\nParticularly, we represent the input image with global and regional visual\nfeatures, we introduce two parallel DCCNs to model multimodal context vectors\nwith visual features at different granularities. Finally, we obtain two\nmultimodal context vectors, which are fused and incorporated into the decoder\nfor the prediction of the target word. Experimental results on the Multi30K\ndataset of English-to-German and English-to-French translation demonstrate the\nsuperiority of DCCN. Our code is available on\nhttps://github.com/DeepLearnXMU/MM-DCCN.\n
Xiayang ShiJiaqi YuanYuanyuan HuangZhenqiang YuPei ChengXinyi Liu
Yuting ZhaoMamoru KomachiTomoyuki KajiwaraChenhui Chu