Journal Article

Reproducing the manual annotation of multiple sequence alignments using a SVM classifier

Christian Blouin, Scott Perry, Allan Lavell, Edward Susko and Andrew J. Roger

in Bioinformatics

Volume 25, issue 23, pages 3093-3098
Published in print December 2009 | ISSN: 1367-4803
Published online September 2009 | e-ISSN: 1460-2059 | DOI:

Show Summary Details


Motivation: Aligning protein sequences with the best possible accuracy requires sophisticated algorithms. Since the optimal alignment is not guaranteed to be the correct one, it is expected that even the best alignment will contain sites that do not respect the assumption of positional homology. Because formulating rules to identify these sites is difficult, it is common practice to manually remove them. Although considered necessary in some cases, manual editing is time consuming and not reproducible. We present here an automated editing method based on the classification of ‘valid’ and ‘invalid’ sites.

Results: A support vector machine (SVM) classifier is trained to reproduce the decisions made during manual editing with an accuracy of 95.0%. This implies that manual editing can be made reproducible and applied to large-scale analyses. We further demonstrate that it is possible to retrain/extend the training of the classifier by providing examples of multiple sequence alignment (MSA) annotation. Near optimal training can be achieved with only 1000 annotated sites, or roughly three samples of protein sequence alignments.

Availability: This method is implemented in the software MANUEL, licensed under the GPL. A web-based application for single and batch job is available at


Supplementary information: Supplementary data are available at Bioinformatics online.

Journal Article.  4057 words.  Illustrated.

Subjects: Bioinformatics and Computational Biology

Users without a subscription are not able to see the full content. Please, subscribe or login to access all content.