A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization

Mourad Jbene; Smail Tigani; Rachid Saadane; Abdellah Chehri

doi:10.1109/dasa53625.2021.9682402

ScienceGate Book Chapters

JOURNAL ARTICLE

A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization

Mourad Jbene Smail Tigani Rachid Saadane Abdellah Chehri

Year: 2021 Journal: 2021 International Conference on Decision Aid Sciences and Application (DASA) Pages: 350-353

DOI: 10.1109/dasa53625.2021.9682402

Get Full-Text PDF Get Analytical Report

Abstract

In recent years Natural language processing is one of the most active areas of research especially with the emergence of deep learning algorithms. More attention has been given to Latin descendent languages e.g English, French, and Spanish given the availability of high-quality datasets and compute resources. In this paper, we present a moroccan News Articles Corpus collected from four of the major moroccan news websites. The corpus contains more than 418k news articles corresponding to 19 different categories, thus considered to be one of the largest Arabic news articles corpora. A description of the collection and processing steps were presented and exploration analysis was performed. To prove the utility of the dataset. An evaluation step was conducted in the context of text classification using four different Machine Learning baselines: Random Forest (RF), Multinomial Naive Bayes (MNB), Support Vector Machine (SVC), and Gradient Boosting (GradBoost) Classifiers. The experimental results are presented in terms of accuracy, F1-score, and confusion matrix.

Keywords:

Computer science Artificial intelligence Natural language processing Support vector machine Naive Bayes classifier Confusion matrix Categorization Context (archaeology) Arabic Random forest Gradient boosting Confusion Machine learning Linguistics Geography

Metrics

Cited By

0.61

FWCI (Field Weighted Citation Impact)

Refs

0.72

Citation Normalized Percentile

Is in top 1%

Is in top 10%

Citation History

Topics

Text and Document Classification Technologies

Physical Sciences → Computer Science → Artificial Intelligence

Spam and Phishing Detection

Physical Sciences → Computer Science → Information Systems

Advanced Text Analysis Techniques

Physical Sciences → Computer Science → Artificial Intelligence

A Moroccan News Articles Dataset (MNAD) For Arabic Text Categorization

Abstract

Metrics

Citation History

Topics

Related Documents

NADiA: News Articles Dataset in Arabic for Multi-Label Text Categorization

SANAD: Single-label Arabic News Articles Dataset for automatic text categorization

PAAD: POLITICAL ARABIC ARTICLES DATASET FOR AUTOMATIC TEXT CATEGORIZATION

Automatic text categorization of news articles

Enhanced automated text categorization via Aquila optimizer with deep learning for Arabic news articles