The use of data-driven model in diabetes detection has gained much attention nowadays to improve the globe medical systems due to its cost-effective and less-invasive methods. The common studies implement statistical feature selection such as PCC or PCA with an assumption of linear relationships, which leads to impracticality in real-life diabetic data. In this paper, a proposed SMOTEENN-based univariate feature selection method is proposed in machine learning-based diabetes classification models. It combines the advantages of SMOTEENN oversampling and univariate feature selection to improve the classification rate with lower dimensional input. A more extensive dataset should be taken into consideration and compared to verify further this method's effectiveness in solving this task. The results acquired from this research implies that this proposed method is effective in achieving high classification accuracy, where the Logistic Regression, Random Forest and Support Vector Machine-based models constructed in this research are able to achieve accuracy of over 90% after feature selection; while reducing the computational cost and time required for the classification tasks at the same time.
S. ChidambaramK. G. Srinivasagan