Journal Article

Model-based clustering of array CGH data

Sohrab P. Shah, K-John Cheung, Nathalie A. Johnson, Guillaume Alain, Randy D. Gascoyne, Douglas E. Horsman, Raymond T. Ng and Kevin P. Murphy

in Bioinformatics

Volume 25, issue 12, pages i30-i38
Published in print June 2009 | ISSN: 1367-4803
Published online May 2009 | e-ISSN: 1460-2059 | DOI:

More Like This

Show all results sharing this subject:

  • Bioinformatics and Computational Biology


Show Summary Details


Motivation: Analysis of array comparative genomic hybridization (aCGH) data for recurrent DNA copy number alterations from a cohort of patients can yield distinct sets of molecular signatures or profiles. This can be due to the presence of heterogeneous cancer subtypes within a supposedly homogeneous population.

Results: We propose a novel statistical method for automatically detecting such subtypes or clusters. Our approach is model based: each cluster is defined in terms of a sparse profile, which contains the locations of unusually frequent alterations. The profile is represented as a hidden Markov model. Samples are assigned to clusters based on their similarity to the cluster's profile. We simultaneously infer the cluster assignments and the cluster profiles using an expectation maximization-like algorithm. We show, using a realistic simulation study, that our method is significantly more accurate than standard clustering techniques. We then apply our method to two clinical datasets. In particular, we examine previously reported aCGH data from a cohort of 106 follicular lymphoma patients, and discover clusters that are known to correspond to clinically relevant subgroups. In addition, we examine a cohort of 92 diffuse large B-cell lymphoma patients, and discover previously unreported clusters of biological interest which have inspired followup clinical research on an independent cohort.

Availability: Software and synthetic datasets are available at∼sshah/acgh as part of the CNA-HMMer package.


Supplementary information: Supplementary data are available at Bioinformatics online.

Journal Article.  6386 words.  Illustrated.

Subjects: Bioinformatics and Computational Biology

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.