Volltext-Downloads (blau) und Frontdoor-Views (grau)

Bitte verwenden Sie diesen Link, wenn Sie dieses Dokument zitieren oder verlinken wollen: https://nbn-resolving.org/urn:nbn:de:gbv:9-opus-140440

Improvement of Eukaryotic Genome Annotation Methods with Extrinsic Evidence and Deep Learning

  • Accurate annotation of gene structures in eukaryotic genomes is important for understanding gene functions and evolution, yet many eukaryotic genomes remain incompletely or inaccurately annotated. Over the upcoming years, large-scale initiatives, such as the Earth BioGenome Project, set the goal to sequence, catalog and characterize the genomes of all eukaryotic species. This requires highly scalable and accurate genome annotation methods. One widely used tool for automated annotation of gene structures in eukaryotic genomes is the BRAKER pipeline, which has accumulated over 2300 citations since 2020. This cumulative dissertation presents methods which improve the state-of-the-art in genome annotation by (A) integrating RNA-Seq and protein homology evidence within the BRAKER pipeline and (B) using deep learning to improve shallow Hidden Markov Model (HMM)-based gene prediction methods. During this thesis three novel tools have been developed that each improved the state-of-the-art accuracy for gene prediction at the time of their publication. Firstly, the combiner tool TSEBRA is presented, which generates BRAKER annotations that are based on RNA-Seq and protein evidence. Previously, BRAKER could only use either RNA-Seq or protein evidence in a single run. TSEBRA merges the separate RNA-Seq-based (BRAKER1) and protein-based (BRAKER2) predictions into a combined gene set, increasing accuracy beyond the individual inputs. Secondly, the subsequently developed BRAKER3 pipeline is presented, which uses both RNA-Seq and protein evidence for training and prediction within BRAKER. BRAKER3 achieves state-of-the-art results across diverse eukaryotes and, in particular, improves annotation of large and complex genomes over its predecessors significantly. Thirdly, Tiberius, a novel deep learning ab initio gene prediction method, is presented. It builds on recent advances by Holst et al. in 2023, which demonstrated that deep learning can improve ab initio gene prediction accuracy beyond shallow HMM-based tools that were state-of-the-art for over 25 years. Tiberius uses sequence-to-sequence layers together with a differentiable HMM, allowing end-to-end training. Benchmarking shows that Tiberius outperforms the accuracy of conventional ab initio tools like AUGUSTUS and, notably, exceeds the accuracy of BRAKER3 on mammalian genomes without requiring extrinsic data, such as RNA-Seq. Fourthly, biological research applications and relevance of the developed tools are showcased by presenting the results of the genome projects of the ground cherry (Prunus fruticosa), sour cherry (Prunus cerasus) and sunflower sea star (Pycnopodia helianthoides). In conclusion, BRAKER3 is currently the most accurate general-purpose genome annotation pipeline available, suitable for a wide range of genome annotation projects. In addition, the development of Tiberius represents a paradigm shift towards the use of deep learning methods for gene prediction, which have the potential to replace traditional shallow HMM-based approaches in genome annotation.

Download full text files

Export metadata

Additional Services

Search Google Scholar
Metadaten
Author: Lars Gabriel
URN:urn:nbn:de:gbv:9-opus-140440
Title Additional (English):Verbesserung eukaryotischer Genomeannotationsmethoden mit Hilfe von extrinsischer Informationen und Deep Learning
Referee:Prof. Dr. Mario Stanke, Prof. Dr. Giulio Formenti, Prof. Dr. Jill Wegrzyn
Advisor:Prof. Dr. Mario Stanke
Document Type:Doctoral Thesis
Language:English
Year of Completion:2025
Date of first Publication:2025/11/12
Granting Institution:Universität Greifswald, Mathematisch-Naturwissenschaftliche Fakultät
Date of final exam:0025/10/09
Release Date:2025/11/12
GND Keyword:Bioinformatik; Maschinelles Lernen
Page Number:213
Faculties:Mathematisch-Naturwissenschaftliche Fakultät / Institut für Mathematik und Informatik
DDC class:500 Naturwissenschaften und Mathematik / 500 Naturwissenschaften