Journal Article

<i>k</i>-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage

Lauren M. Bragg and Glenn Stone

in Bioinformatics

Volume 25, issue 18, pages 2302-2308
Published in print September 2009 | ISSN: 1367-4803
Published online July 2009 | e-ISSN: 1460-2059 | DOI: http://dx.doi.org/10.1093/bioinformatics/btp410

More Like This

Show all results sharing this subject:

  • Bioinformatics and Computational Biology

GO

Show Summary Details

Preview

Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm.

Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared.

Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++.

Contact: lauren.bragg@csiro.au

Supplementary information: Supplementary data are available at Bioinformatics online.

Journal Article.  5489 words.  Illustrated.

Subjects: Bioinformatics and Computational Biology

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.