JOURNAL ARTICLE

Document Classification of Protein Sequences

Betty Yee Man ChengJaime G. CarbonellJudith Klein‐Seetharaman

Year: 2003 Journal:   OPAL (Open@LaTrobe) (La Trobe University)   Publisher: La Trobe University

Abstract

The need for accurate, automated protein classification methods continues to increase as advances in biotechnology uncovers new proteins at a fast rate. G-protein coupled receptors (GPCRs) are a particularly difficult superfamily of proteins to classify due to the extreme diversity among its members; yet, they are an important subject in pharmacological research being the target of approximately 60% of current drugs (Muller, 2000). A comparison of BLAST, k-NN, HMM and SVM with alignment-based features by Karchin et al. (2002) has suggested that classifiers at the complexity of SVM are needed to attain high accuracy in GPCR subfamily classification. Here, analogous to document classification, we applied Decision Tree and Naïve Bayes classifiers with chi-square feature selection on n-gram counts to the GPCR family and subfamily classification task. Using the dataset and evaluation protocol from the previous study, we found the Naïve Bayes classifier surpassing the reported accuracy of SVM by 4.8% and 6.1% in level I and II subfamily classification with an accuracy of 93.2% and 92.4% respectively. The Decision Tree, while inferior to SVM, still outperforms HMM in both level I and II subfamily classification. Moreover, the n-grams selected by chi-square feature selection show evidence of biological importance. Thus, the document classification approach has resulted in a simpler, more accurate and interpretable classifier.

Keywords:
Naive Bayes classifier Artificial intelligence Support vector machine Subfamily Feature selection Pattern recognition (psychology) Classifier (UML) Machine learning Computer science Decision tree Linear classifier Biology Genetics Gene

Metrics

0
Cited By
0.00
FWCI (Field Weighted Citation Impact)
66
Refs
0.01
Citation Normalized Percentile
Is in top 1%
Is in top 10%

Topics

Machine Learning in Bioinformatics
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology
Biochemical and Structural Characterization
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology
Receptor Mechanisms and Signaling
Life Sciences →  Biochemistry, Genetics and Molecular Biology →  Molecular Biology
© 2026 ScienceGate Book Chapters — All rights reserved.