Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension

Daniel Andrade; Takuya Matsuzaki; Jun’ichi Tsujii

doi:10.1145/2184436.2184439

ScienceGate Book Chapters

JOURNAL ARTICLE

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension

Daniel Andrade Takuya Matsuzaki Jun’ichi Tsujii

Year: 2012 Journal: ACM Transactions on Asian Language Information Processing Vol: 11 (2)Pages: 1-31 Publisher: Association for Computing Machinery

DOI: 10.1145/2184436.2184439

Get Full-Text PDF Get Analytical Report

Abstract

Bilingual dictionaries can be automatically extended by new translations using comparable corpora. The general idea is based on the assumption that similar words have similar contexts across languages. However, previous studies have mainly focused on Indo-European languages, or use only a bag-of-words model to describe the context. Furthermore, we argue that it is helpful to extract only the statistically significant context, instead of using all context. The present approach addresses these issues in the following manner. First, based on the context of a word with an unknown translation (query word), we extract salient pivot words. Pivot words are words for which a translation is already available in a bilingual dictionary. For the extraction of salient pivot words, we use a Bayesian estimation of the point-wise mutual information to measure statistical significance. In the second step, we match these pivot words across languages to identify translation candidates for the query word. We therefore calculate a similarity score between the query word and a translation candidate using the probability that the same pivots will be extracted for both the query word and the translation candidate. The proposed method uses several context positions, namely, a bag-of-words of one sentence, and the successors, predecessors, and siblings with respect to the dependency parse tree of the sentence. In order to make these context positions comparable across Japanese and English, which are unrelated languages, we use several heuristics to adjust the dependency trees appropriately. We demonstrate that the proposed method significantly increases the accuracy of word translations, as compared to previous methods.

Keywords:

Computer science Natural language processing Artificial intelligence Word (group theory) Context (archaeology) Machine translation Sentence Dependency (UML) Bilingual dictionary Parsing Linguistics

Metrics

Cited By

1.14

FWCI (Field Weighted Citation Impact)

Refs

0.83

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Mathematics, Computing, and Information Processing

Physical Sciences → Computer Science → Computational Theory and Mathematics

Statistical Extraction and Comparison of Pivot Words for Bilingual Lexicon Extension

Abstract

Metrics

Citation History

Topics

Related Documents

Extended pivot-based approach for bilingual lexicon extraction

Evaluating a Pivot-Based Approach for Bilingual Lexicon Extraction

A statistical view on bilingual lexicon extraction

Bilingual Lexicon Extraction

Bilingual Lexicon Extraction