KOSHIK- A Large-scale Distributed Computing Framework for NLP

Peter Exner; Pierre Nugues

doi:10.5220/0004707704630470

ScienceGate Book Chapters

JOURNAL ARTICLE

KOSHIK- A Large-scale Distributed Computing Framework for NLP

Peter Exner Pierre Nugues

Year: 2014 Pages: 463-470

DOI: 10.5220/0004707704630470

Get Full-Text PDF Get Analytical Report

Abstract

In this paper, we describe KOSHIK, an end-to-end framework to process the unstructured natural language content of multilingual documents. We used the Hadoop distributed computing infrastructure to build this framework as it enables KOSHIK to easily scale by adding inexpensive commodity hardware. We designed an annotation model that allows the processing algorithms to incrementally add layers of annotation without modifyingtheoriginaldocument. We used the Avro binary format to serialize th edocuments. Avro is designed for Hadoop and allows other data warehousing tools to directly query the documents. This paper reports the implementation choices and details of the framework,the annotation model,the options for querying processed data, and the parsing results on the English and Swedish editions of Wikipedia.

Keywords:

Computer science Annotation Parsing Serialization Process (computing) Scale (ratio) Information retrieval Natural language processing Artificial intelligence Programming language

Metrics

Cited By

4.83

FWCI (Field Weighted Citation Impact)

Refs

0.94

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Natural Language Processing Techniques

Physical Sciences → Computer Science → Artificial Intelligence

Topic Modeling

Physical Sciences → Computer Science → Artificial Intelligence

Semantic Web and Ontologies

Physical Sciences → Computer Science → Artificial Intelligence

KOSHIK- A Large-scale Distributed Computing Framework for NLP

Abstract

Metrics

Citation History

Topics

Related Documents

CGSim: A Simulation Framework for Large Scale Distributed Computing Environment

Collaborative Learning Based Straggler Prevention in Large-Scale Distributed Computing Framework

An Optimized Straggler Mitigation Framework for Large-Scale Distributed Computing Systems

Large-Scale Distributed Computing and Applications

Large-scale distributed graph computing systems