learnMSA: learning and aligning large protein families

Becker, Felix; Stanke, Mario

doi:10.1093/gigascience/giac104

search hit 38 of 5264

Back to Result List

Bitte verwenden Sie diesen Link, wenn Sie dieses Dokument zitieren oder verlinken wollen: https://nbn-resolving.org/urn:nbn:de:gbv:9-opus-109706

learnMSA: learning and aligning large protein families

Felix Becker, Mario Stanke

Background The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. Results We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum–Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. Conclusions Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.

Metadaten
Author:	Felix Becker, Mario Stanke
URN:	urn:nbn:de:gbv:9-opus-109706
DOI:	https://doi.org/10.1093/gigascience/giac104
ISSN:	2047-217X
Parent Title (English):	GigaScience
Publisher:	Oxford University Press
Place of publication:	Oxford
Document Type:	Article
Language:	English
Date of first Publication:	2022/11/18
Release Date:	2024/04/11
Tag:	machine learning; multiple sequence alignment; profile hidden Markov model
Volume:	11
Article Number:	giac104
Page Number:	14
Faculties:	Mathematisch-Naturwissenschaftliche Fakultät / Institut für Mathematik und Informatik
Collections:	weitere DFG-förderfähige Artikel
Licence (German):	Creative Commons - Namensnennung 4.0 International

Open Access

Bitte verwenden Sie diesen Link, wenn Sie dieses Dokument zitieren oder verlinken wollen: https://nbn-resolving.org/urn:nbn:de:gbv:9-opus-109706

learnMSA: learning and aligning large protein families

Download full text files

Export metadata

Additional Services

Statistics