Journal Article

Predictor correlation impacts machine learning algorithms: implications for genomic studies

Kristin K. Nicodemus and James D. Malley

in Bioinformatics

Volume 25, issue 15, pages 1884-1890
Published in print August 2009 | ISSN: 1367-4803
Published online May 2009 | e-ISSN: 1460-2059 | DOI: http://dx.doi.org/10.1093/bioinformatics/btp331
Predictor correlation impacts machine learning algorithms: implications for genomic studies

More Like This

Show all results sharing this subject:

  • Bioinformatics and Computational Biology

GO

Show Summary Details

Preview

Motivation: The advent of high-throughput genomics has produced studies with large numbers of predictors (e.g. genome-wide association, microarray studies). Machine learning algorithms (MLAs) are a computationally efficient way to identify phenotype-associated variables in high-dimensional data. There are important results from mathematical theory and numerous practical results documenting their value. One attractive feature of MLAs is that many operate in a fully multivariate environment, allowing for small-importance variables to be included when they act cooperatively. However, certain properties of MLAs under conditions common in genomic-related data have not been well-studied—in particular, correlations among predictors pose a problem.

Results: Using extensive simulation, we showed considering correlation within predictors is crucial in making valid inferences using variable importance measures (VIMs) from three MLAs: random forest (RF), conditional inference forest (CIF) and Monte Carlo logic regression (MCLR). Using a case–control illustration, we showed that the RF VIMs—even permutation-based—were less able to detect association than other algorithms at effect sizes encountered in complex disease studies. This reduction occurred when ‘causal’ predictors were correlated with other predictors, and was sharpest when RF tree building used the Gini index. Indeed, RF Gini VIMs are biased under correlation, dependent on predictor correlation strength/number and over-trained to random fluctuations in data when tree terminal node size was small. Permutation-based VIM distributions were less variable for correlated predictors and are unbiased, thus may be preferred when predictors are correlated. MLAs are a powerful tool for high-dimensional data analysis, but well-considered use of algorithms is necessary to draw valid conclusions.

Contact: kristin.nicodemus@well.ox.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Journal Article.  4724 words.  Illustrated.

Subjects: Bioinformatics and Computational Biology

Full text: subscription required

How to subscribe Recommend to my Librarian

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.