Journal Article

A fast and automated solution for accurately resolving protein domain architectures

Corin Yeats, Oliver C. Redfern and Christine Orengo

in Bioinformatics

Volume 26, issue 6, pages 745-751
Published in print March 2010 | ISSN: 1367-4803
Published online January 2010 | e-ISSN: 1460-2059 | DOI: http://dx.doi.org/10.1093/bioinformatics/btq034
A fast and automated solution for accurately resolving protein domain architectures

More Like This

Show all results sharing this subject:

  • Bioinformatics and Computational Biology

GO

Show Summary Details

Preview

Motivation: Accurate prediction of the domain content and arrangement in multi-domain proteins (which make up >65% of the large-scale protein databases) provides a valuable tool for function prediction, comparative genomics and studies of molecular evolution. However, scanning a multi-domain protein against a database of domain sequence profiles can often produce conflicting and overlapping matches. We have developed a novel method that employs heaviest weighted clique-finding (HCF), which we show significantly outperforms standard published approaches based on successively assigning the best non-overlapping match (Best Match Cascade, BMC).

Results: We created benchmark data set of structural domain assignments in the CATH database and a corresponding set of Hidden Markov Model-based domain predictions. Using these, we demonstrate that by considering all possible combinations of matches using the HCF approach, we achieve much higher prediction accuracy than the standard BMC method. We also show that it is essential to allow overlapping domain matches to a query in order to identify correct domain assignments. Furthermore, we introduce a straightforward and effective protocol for resolving any overlapping assignments, and producing a single set of non-overlapping predicted domains.

Availability and implementation: The new approach will be used to determine MDAs for UniProt and Ensembl, and made available via the Gene3D website: http://gene3d.biochem.ucl.ac.uk/Gene3D/. The software has been implemented in C++ and compiled for Linux: source code and binaries can be found at: ftp://ftp.biochem.ucl.ac.uk/pub/gene3d_data/DomainFinder3/

Contact: yeats@biochem.ucl.ac.uk

Supplementary information: Supplementary data are available at Bioinformatics online.

Journal Article.  5379 words.  Illustrated.

Subjects: Bioinformatics and Computational Biology

Full text: subscription required

How to subscribe Recommend to my Librarian

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.