JOURNAL ARTICLE

Dual-Alignment Pre-training for Cross-lingual Sentence Embedding

Abstract

Recent studies have shown that dual encoder models trained with the sentence-level translation ranking task are effective methods for cross-lingual sentence embedding.However, our research indicates that token-level alignment is also crucial in multilingual scenarios, which has not been fully explored previously.Based on our findings, we propose a dual-alignment pre-training (DAP) framework for cross-lingual sentence embedding that incorporates both sentence-level and token-level alignment.To achieve this, we introduce a novel representation translation learning (RTL) task, where the model learns to use one-side contextualized token representation to reconstruct its translation counterpart.This reconstruction objective encourages the model to embed translation information into the token representation.Compared to other token-level alignment methods such as translation language modeling, RTL is more suitable for dual encoder architectures and is computationally efficient.Extensive experiments on three sentencelevel cross-lingual benchmarks demonstrate that our approach can significantly improve sentence embedding.

Keywords:
Sentence Embedding Translation (biology) Encoder Representation (politics) Security token Machine translation Task (project management)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
0.31
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Advanced Graph Neural Networks
Physical Sciences →  Computer Science →  Artificial Intelligence
Sentiment Analysis and Opinion Mining
Physical Sciences →  Computer Science →  Artificial Intelligence
© 2026 ScienceGate Book Chapters — All rights reserved.