Data-Efficient Bilingual Lexicon Induction with Pretrained Language Models

Li, Yaoyiran

doi:10.17863/cam.119904

ScienceGate Book Chapters

DISSERTATION

Data-Efficient Bilingual Lexicon Induction with Pretrained Language Models

Li, Yaoyiran

Year: 2024 University: Apollo (University of Cambridge) Publisher: University of Cambridge

DOI: 10.17863/cam.119904

Get Full-Text PDF Get Analytical Report

Abstract

Bilingual dictionaries are essential language resources that play a crucial role in the development of modern multilingual and cross-lingual natural language processing (NLP) systems, particularly for resource-lean languages. Although there are 7,000+ languages spoken worldwide, existing bilingual dictionaries are limited in both quality and quantity. This thesis specifically focuses on the task of Bilingual Lexicon Induction (BLI) and proposes a series of innovative data-efficient BLI approaches aimed at automatically inducing high-quality bilingual dictionaries in low-data scenarios, thereby bridging the lexical gaps between languages. While previous BLI methods rely on mapping static word embeddings, inspired by the paradigm shifts towards pretrained language models (PLMs), we investigate leveraging PLMs for BLI. Firstly, we propose a two-stage contrastive learning framework, combing cross-lingual word embeddings (CLWEs) mapped from static embeddings and those extracted from PLMs, both learned with contrastive learning (Chapter 3). Secondly, we put forth a retrieve-and-rerank approach where we first use any precalculated CLWEs to retrieve a small set of candidate translations and then leverage PLMs as cross-encoder rerankers for BLI (Chapter 4). Thirdly, we investigate if it is possible to prompt autoregressive large language models (LLMs) for BLI, which completely deviates from traditional mapping-based approaches. We further employ retrieval-augmented in-context learning (ICL) to boost the performance and propose inducing a self-augmented high-confidence dictionary to be used in an ICL fashion for the unsupervised BLI task (Chapter 5). These three studies demonstrate the effectiveness of utilising PLMs for BLI and keep pushing the boundaries of BLI by establishing robust and new state-of-the-art BLI performance progressively. Finally, recognising the usefulness of BLI in neural machine translation (NMT), as indicated by related work, we further propose an NMT-enhanced parameter-efficient cross-lingual transfer learning framework for multilingual text-to-image generation (Chapter 6). In this application-oriented study, we demonstrate that translation-based approaches, again focusing on low-data setups, can yield strong cross-lingual transfer capabilities also in multilingual text-to-image generation, previously limited to English-only monolingual settings.

Keywords:

Lexicon Leverage (statistics) Task (project management) Language model Bridging (networking) Word (group theory) Natural language understanding Set (abstract data type)

Metrics

Cited By

0.00

FWCI (Field Weighted Citation Impact)

Refs

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Text and Document Classification Technologies

Physical Sciences → Computer Science → Artificial Intelligence

Data-Efficient Bilingual Lexicon Induction with Pretrained Language Models

Abstract

Metrics

Topics

Related Documents

On Bilingual Lexicon Induction with Large Language Models

Data-efficient domain adaptation for pretrained language models

Bilingual lexicon induction through a pivot language

Efficient transfer learning with pretrained language models

Improving Bilingual Lexicon Induction on Distant Language Pairs