DISSERTATION

Data-Efficient Bilingual Lexicon Induction with Pretrained Language Models

Li, Yaoyiran

Year: 2024 University:   Apollo (University of Cambridge)   Publisher: University of Cambridge

Abstract

Bilingual dictionaries are essential language resources that play a crucial role in the development of modern multilingual and cross-lingual natural language processing (NLP) systems, particularly for resource-lean languages. Although there are 7,000+ languages spoken worldwide, existing bilingual dictionaries are limited in both quality and quantity. This thesis specifically focuses on the task of Bilingual Lexicon Induction (BLI) and proposes a series of innovative data-efficient BLI approaches aimed at automatically inducing high-quality bilingual dictionaries in low-data scenarios, thereby bridging the lexical gaps between languages. While previous BLI methods rely on mapping static word embeddings, inspired by the paradigm shifts towards pretrained language models (PLMs), we investigate leveraging PLMs for BLI. Firstly, we propose a two-stage contrastive learning framework, combing cross-lingual word embeddings (CLWEs) mapped from static embeddings and those extracted from PLMs, both learned with contrastive learning (Chapter 3). Secondly, we put forth a retrieve-and-rerank approach where we first use any precalculated CLWEs to retrieve a small set of candidate translations and then leverage PLMs as cross-encoder rerankers for BLI (Chapter 4). Thirdly, we investigate if it is possible to prompt autoregressive large language models (LLMs) for BLI, which completely deviates from traditional mapping-based approaches. We further employ retrieval-augmented in-context learning (ICL) to boost the performance and propose inducing a self-augmented high-confidence dictionary to be used in an ICL fashion for the unsupervised BLI task (Chapter 5). These three studies demonstrate the effectiveness of utilising PLMs for BLI and keep pushing the boundaries of BLI by establishing robust and new state-of-the-art BLI performance progressively. Finally, recognising the usefulness of BLI in neural machine translation (NMT), as indicated by related work, we further propose an NMT-enhanced parameter-efficient cross-lingual transfer learning framework for multilingual text-to-image generation (Chapter 6). In this application-oriented study, we demonstrate that translation-based approaches, again focusing on low-data setups, can yield strong cross-lingual transfer capabilities also in multilingual text-to-image generation, previously limited to English-only monolingual settings.

Keywords:
Lexicon Leverage (statistics) Task (project management) Language model Bridging (networking) Word (group theory) Natural language understanding Set (abstract data type)

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
0
Refs
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Text and Document Classification Technologies
Physical Sciences →  Computer Science →  Artificial Intelligence

Related Documents

© 2026 ScienceGate Book Chapters — All rights reserved.