Ziheng LiShaohan HuangZihan ZhangZhihong DengQiang LouHaizhen HuangJian JiaoFuru WeiWeiwei DengQi Zhang
Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding.However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously.Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment.To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.This reconstruction objective encourages the model to embed translation information into the token representation.Compared to other token-level alignment methods such as translation language modeling, RTL is more suitable for dual encoder architectures and is computationally efficient.Extensive experiments on three sentencelevel cross-lingual benchmarks demonstrate that our approach can significantly improve sentence embedding.
Ziheng LiShaohan HuangZihan ZhangZhihong DengQiang LouHaizhen HuangJian JiaoFuru WeiWeiwei DengQi Zhang
Ziheng LiShaohan HuangZihan ZhangZhihong DengQiang LouHaizhen HuangJian JiaoFuru WeiWeiwei DengQi Zhang
Zhongtao MiaoQiyu WuKaiyan ZhaoZilong WuYoshimasa Tsuruoka
Koustava GoswamiSourav DuttaAssem, HaythamFransen, TheodorusMcCrae, John P.
Koustava GoswamiSourav DuttaHaytham AssemTheodorus FransenJohn P. McCrae