Journal Article

Efficient estimation of pairwise distances between genomes

Mirjana Domazet-Lošo and Bernhard Haubold

in Bioinformatics

Volume 25, issue 24, pages 3221-3227
Published in print December 2009 | ISSN: 1367-4803
Published online October 2009 | e-ISSN: 1460-2059 | DOI:
Efficient estimation of pairwise distances between genomes

Show Summary Details


Motivation: Genome comparison is central to contemporary genomics and typically relies on sequence alignment. However, genome-wide alignments are difficult to compute. We have, therefore, recently developed an accurate alignment-free estimator of the number of substitutions per site based on the lengths of exact matches between pairs of sequences. The previous implementation of this measure requires n(n−1) suffix tree constructions and traversals, where n is the number of sequences analyzed. This does not scale well for large n.

Results: We present an algorithm to extract pairwise distances in a single traversal of a single suffix tree containing n sequences. As a result, the run time of the suffix tree construction phase of our algorithm is reduced from O(n2L) to O(nL), where L is the length of each sequence. We implement this algorithm in the program kr version 2 and apply it to 825 HIV genomes, 13 genomes of enterobacteria and the complete genomes of 12 Drosophila species. We show that, depending on the input dataset, the new program is at least 10 times faster than its predecessor.

Availability: Version 2 of kr can be tested via a web interface at It is written in standard C and its source code is available under the GNU General Public License from the same web site.


Supplementary informations: Supplementary data are available at Bioinformatics online.

Journal Article.  4616 words.  Illustrated.

Subjects: Bioinformatics and Computational Biology

Full text: subscription required

How to subscribe Recommend to my Librarian

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.