The rapid advancement of high-throughput sequencing technologies has produced unprecedented amounts and types of omic data. Predicting clinical outcomes based on genomic features like gene expression, methylation, and genotypes is becoming increasingly important for individualized risk assessment and treatment. Associated with genomic features, there is also a rich set of meta-features such as functional annotation, pathway information, and knowledge from previous studies, that comprise valuable additional information. Traditionally, such meta-feature information is used in a post-hoc manner to enhance model explainability. For example, after model fit, analysis can be conducted to formally assess whether the selected gene features are enriched in particular metabolic pathways or gene ontology annotations. This kind of post-hoc analysis can provide biological insights and validation for a prediction model. In this dissertation, we propose novel methods that exploit genomic meta-features a-priori rather than post-hoc, to improve better identify important markers and improve prediction performance. We aim at addressing one central question: how can we predict an outcome of interest and identify relevant features while taking additional information on the features into account? ? Since genomic data sets are typically high-dimensional, penalized regression methods are commonly used to select relevant features and build predictive models. Standard penalized regression applies one penalty parameter to all features, ignoring the structural difference or heterogeneity of features. Based on this, we integrate meta-features into penalized regression by adapting the penalty parameters to be meta-feature-driven. The penalty parameters are modeled as a log-linear function of the meta-features and are estimated from the data using an approximate empirical Bayes approach. ? This dissertation is structured as follows. Chapter 1 introduces how penalized regression techniques can be used to solve high dimensional data problems. Chapter 2 describes an empirical Bayes approach to select the penalty parameter(s) in penalized regression. Chapter 3 discusses our method for incorporating meta-features into LASSO linear regression. Chapter 4 is devoted to the optimization algorithms for marginal likelihood maximization. Chapter 5 extends the model to Ridge and Elastic-Net linear and logistic regression. Finally, Chapter 6 presents the R package we developed to implement our method.
Chubing ZengDuncan C. ThomasJuan Pablo Lewinger
Zhe SunZengke ZhangHuangang Wang
Zhe SunZengke ZhangHuangang Wang
Jin YuanKesheng WangChangyuan YuXuemei Liu