Wencan ZhuCéline Lévy-LeducNils Ternès
In bioinformatics, the rapid development of sequencing technology has enabled us to collect an increasing amount of omics data. Classification based on omics data is one of the central problems in biomedical research. However, omics data usually has a limited sample size but high feature dimensions, and it is assumed that only a few features (biomarkers) are active, i.e. informative to discriminate between different categories. Identifying active biomarkers for classification has therefore become fundamental for omics data analysis. Focusing on binary classification, we propose an innovative feature selection method aiming at dealing with the high correlations between the biomarkers. Our method, WLogit, consists in whitening the design matrix to remove the correlations between biomarkers, then using a penalized criterion adapted to the logistic regression model to select features. The results from numerical experiments suggest that WLogit can identify almost all active biomarkers even in the cases where the biomarkers are highly correlated, while the other methods fail, which consequently leads to higher classification accuracy. The performance of WLogit is also evaluated on two publicly available datasets, and the obtained classifier outperformed other methods in terms of prediction accuracy. Our method is implemented in the WLogit R package available from the Comprehensive R Archive Network (CRAN).
Dietmar ZellnerFrieder KellerGünter E. Zellner
Zhuanzhuan MaZifei HanSouparno GhoshLiucang WuMin Wang
Luca InsoliaAna KenneyMartina CaloviFrancesca Chiaromonte
Martin J. WainwrightPradeep RavikumarJohn Lafferty