Ronald TuduShaibal SahaPrasun Nandy PritamRajesh Palit
In this digital era, enormous amount of data are being generated everyday, and most of them are unstructured textual data. An automated text classifier helps to categorize the texts automatically into pre-defined categories. With the help of machine learning we can learn about the features of precategorized documents and predict document's category. Bengali language is one of the most spoken languages in the world. It has become essential to implement automated text categorization for Bengali language. Text categorization mostly uses data mining algorithms along with NLP tools, feature extraction and selection methods with vector space modeling. In this paper, we have measured the performance of Support Vector Machine (SVM), Multinomial Naive Bayes (MNB), Stochastic Gradient Descent (SGD) and Logistic Regression (LR) methods using an open source Bengali newspaper article corpus containing 84; 906 articles of 10 categories. The impact of the size of the training dataset on the accuracy of the classification was examined for different algorithms. We have documented the execution time to train the methods and discussed issues and challenges in Bengali text categorization. This paper can be used as a reference work for future researchers in Bengali text categorization.
Ishaan DawarNarendra KumarSakshi NegiSayeedakhanum PathanShirshendu Layek
K. RaghuveerK. N. Balasubramanya Murthy
Bonthala Prabhanjan YadavSukhaveerji GhateA. HarshavardhanG. JhansiKomuravelly Sudheer KumarE. C. G. Sudarshan
Shweta S. AladakattiSenthil Kumar Swami Durai