Journal Article

Gclust: <i>trans</i>-kingdom classification of proteins using automatic individual threshold setting

Naoki Sato

in Bioinformatics

Volume 25, issue 5, pages 599-605
Published in print March 2009 | ISSN: 1367-4803
Published online January 2009 | e-ISSN: 1460-2059 | DOI:
Gclust: trans-kingdom classification of proteins using automatic individual threshold setting

Show Summary Details


Motivation: Trans-kingdom protein clustering remained difficult because of large sequence divergence between eukaryotes and prokaryotes and the presence of a transit sequence in organellar proteins. A large-scale protein clustering including such divergent organisms needs a heuristic to efficiently select similar proteins by setting a proper threshold for homologs of each protein. Here a method is described using two similarity measures and organism count.

Results: The Gclust software constructs minimal homolog groups using all-against-all BLASTP results by single-linkage clustering. Major points include (i) estimation of domain structure of proteins; (ii) exclusion of multi-domain proteins; (iii) explicit consideration of transit peptides; and (iv) heuristic estimation of a similarity threshold for homologs of each protein by entropy-optimized organism count method. The resultant clusters were evaluated in the light of power law. The software was used to construct protein clusters for up to 95 organisms.

Availability: Software and data are available at


Supplementary information: Supplementary data are available at Bioinformatics online.

Journal Article.  5118 words.  Illustrated.

Subjects: Bioinformatics and Computational Biology

Full text: subscription required

How to subscribe Recommend to my Librarian

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.