JOURNAL ARTICLE

Semi-Automatic Parallel Corpora Extraction from Comparable News Corpora

Abstract

"The parallel corpus is a necessary resource in many multi/cross lingual natural language processing applications that include Machine Translation and Cross Lingual Information Retreival. Preparation of large scale parallel corpus takes time and also demands the linguistics skill. In the present work, a technique has been developed that extracts parallel corpus between Manipuri, a morphologically rich and resource constrained Indian language and English from a comparable news corpora collected from the web. A medium sized Manipuri-English bilingual lexicon and another list of Manipuri-English transliterated entities have been developed and used in the present work. Using morphological information for the agglutinative and inflective Manipuri language, the alignment quality based on similarity measure is further improved. A high level of performance is desirable since errors in sentence alignment cause further errors in systems that use the aligned text. The system has been evaluated and error analysis has also been carried out. The technique shows its effectiveness in Manipuri-English language pairandis extendable to other resource constrained, agglutinative and inflective Indian languages."

Keywords:
Computer science Natural language processing Artificial intelligence Extraction (chemistry) Parallel corpora Chromatography Chemistry

Metrics

5
Cited By
1.60
FWCI (Field Weighted Citation Impact)
19
Refs
0.86
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Citation History

Topics

Natural Language Processing Techniques
Physical Sciences →  Computer Science →  Artificial Intelligence
Topic Modeling
Physical Sciences →  Computer Science →  Artificial Intelligence
Handwritten Text Recognition Techniques
Physical Sciences →  Computer Science →  Computer Vision and Pattern Recognition

Related Documents

BOOK-CHAPTER

Parallel Texts Extraction from Multimodal Comparable Corpora

Haithem AfliLoïc BarraultHolger Schwenk

Lecture notes in computer science Year: 2012 Pages: 40-51
JOURNAL ARTICLE

A Systematic Literature Review on Extraction of Parallel Corpora from Comparable Corpora

Dilshad KaurSatwinder Singh

Journal:   Journal of Computer Science Year: 2021 Vol: 17 (10)Pages: 924-952
BOOK-CHAPTER

Comparable parallel corpora

Lidun Hareide

Studies in corpus linguistics Year: 2019 Pages: 19-38
JOURNAL ARTICLE

Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora

Chenhui ChuToshiaki NakazawaSadao Kurohashi

Journal:   ACM Transactions on Asian and Low-Resource Language Information Processing Year: 2015 Vol: 15 (2)Pages: 1-22
JOURNAL ARTICLE

Comparable or Parallel Corpora?

Wolfgang Teubert

Journal:   International Journal of Lexicography Year: 1996 Vol: 9 (3)Pages: 238-264
© 2026 ScienceGate Book Chapters — All rights reserved.