Abstract This paper discusses the application of random forest algorithm based on sparrow search algorithm optimization in protein classification prediction task. Firstly, the data set was analyzed by Pearson correlation analysis. It was found that the most positively correlated attribute with protein category was crystallization temperature K, while the most negatively correlated attribute was molecular weight. Then, the performance of decision tree, random forest, BP neural network, xgboost, catboost and other basic models as well as the random forest algorithm based on sparrow search optimization proposed in this paper in protein category prediction is compared. The results show that the prediction accuracy of random forest in the basic model is the highest, reaching 70.7%, while the accuracy of BP neural network is the lowest, only 35%. In contrast, the accuracy, recall, accuracy and F1 score of the optimization model proposed in this paper are improved, and the accuracy rate reaches 73%, which is 2.3% higher than the best performance in the basic model. In addition, this paper also outputs the confusion matrix of the model training set and the test set. The prediction accuracy of the training set is 99.4%, and that of the test set is 73%. Although the accuracy has decreased by 25.6%, the model still shows a certain generalization ability. Future research will continue to optimize the algorithm to further improve the generalization ability of the model. The significance of this paper is that the introduction of sparrow search algorithm to optimize the random forest not only improves the accuracy of protein classification prediction, but also provides a new method for dealing with similar bioinformatics problems. This research not only promotes the application of machine learning in bioinformatics, but also provides a new perspective and ideas for future algorithm optimization and model improvement.
Ying ChenZheng‐Ying LiuChongxuan XuXueliang ZhaoLili PangKang LiYanxin Shi
Meng WangGuoyan ZhaoShaofeng Wang