## Institut für Mathematik und Informatik

### Refine

#### Year of publication

#### Document Type

- Doctoral Thesis (57)
- Article (31)
- Final Thesis (1)

#### Has Fulltext

- yes (89)

#### Is part of the Bibliography

- no (89)

#### Keywords

- - (21)
- Statistik (5)
- Numerische Mathematik (4)
- Optimale Kontrolle (4)
- fractal (4)
- permutation entropy (4)
- Bioinformatik (3)
- Fraktal (3)
- Logarithmic capacity (3)
- Optimale Steuerung (3)

#### Institute

#### Publisher

- MDPI (14)
- Frontiers Media S.A. (6)
- Springer Nature (5)
- BioMed Central (BMC) (2)
- Oxford University Press (1)
- Wiley (1)

We apply the charge simulation method (CSM) in order to compute the logarithmic capacity of compact sets consisting of (infinitely) many “small” components. This application allows to use just a single charge point for each component. The resulting method therefore is significantly more efficient than methods based on discretizations of the boundaries (for example, our own method presented in Liesen et al. (Comput. Methods Funct. Theory 17, 689–713, 2017)), while maintaining a very high level of accuracy. We study properties of the linear algebraic systems that arise in the CSM, and show how these systems can be solved efficiently using preconditioned iterative methods, where the matrix-vector products are computed using the fast multipole method. We illustrate the use of the method on generalized Cantor sets and the Cantor dust.

Background
The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments.
Results
We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum–Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU.
Conclusions
Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.

The classical Buscher rules d escribe T-duality for metrics and B-fields in a topologically trivial setting. On the other hand, topological T-duality addresses aspects of non-trivial topology while neglecting metrics and B-fields. In this article, we develop a new unifying framework for both aspects.

Convolutional Neural Network-based image classification models are the current state-of-the-art for solving image classification problems. However, obtaining and using such a model to solve a specific image classification problem presents several challenges in practice. To train the model, we need to find good hyperparameter values for training, such as initial model weights or learning rate. However, finding these values is usually a non-trivial process. Another problem is that the training data used for model training is often class-imbalanced in practice. This usually has a negative impact on model training. However, not only is it challenging to obtain a Convolutional Neural Network-based model, but also to use the model after model training. After training, the model might be applied to images that were drawn from a data distribution that is different from the data distribution the training data was drawn from. These images are typically referred to as out-of-distribution samples. Unfortunately, Convolutional Neural Network-based image classification models typically fail to predict the correct class for out-of-distribution samples without warning, which is problematic when such a model is used for safety-critical applications. In my work, I examined whether information from the layers of a Convolutional Neural Network-based image classification model (pixels and activations) can be used to address all of these issues. As a result, I suggest a method for initializing the model weights based on image patches, a method for balancing a class-imbalanced dataset based on layer activations, and a method for detecting out-of-distribution samples, which is also based on layer activations. To test the proposed methods, I conducted extensive experiments using different datasets. My experiments showed that layer information (pixels and activations) can indeed be used to address all of the aforementioned challenges when training and using Convolutional Neural Network-based image classification models.

Phylogenetic (i.e., leaf-labeled) trees play a fundamental role in evolutionary research. A typical problem is to reconstruct such trees from data like DNA alignments (whose columns are often referred to as characters), and a simple optimization criterion for such reconstructions is maximum parsimony. It is generally assumed that this criterion works well for data in which state changes are rare. In the present manuscript, we prove that each binary phylogenetic tree T with n ≥ 20k leaves is uniquely defined by the set Ak (T), which consists of all characters with parsimony score k on T. This can be considered as a promising first step toward showing that maximum parsimony as a tree reconstruction criterion is justified when the number of changes in the data is relatively small.

We consider Walsh’s conformal map from the exterior of a compact set E ⊆ C onto a lemniscatic domain. If E is simply connected, the lemniscatic domain is the exterior of a circle, while if E has several components, the lemniscatic domain is the exterior of a generalized lemniscate and is determined by the logarithmic capacity of E and by the exponents and centers of the generalized lemniscate. For general E, we characterize the exponents in terms of the Green’s function of Ec. Under additional symmetry conditions on E, we also locate the centers of the lemniscatic domain. For polynomial pre-images E = P−1(Ω) of a simply-connected infinite compact set Ω, we explicitly determine the exponents in the lemniscatic domain and derive a set of equations to determine the centers of the lemniscatic domain. Finally, we present several examples where we explicitly obtain the exponents and centers of the lemniscatic domain, as well as the conformal map.

Background
The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes.
Results
Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments.
Conclusions
Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.

Background
An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only.
Results
We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity.
Conclusions
An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.

Statistical Methods and Applications for Biomarker Discovery Using Large Scale Omics Data Set
(2023)

This thesis focuses on identifying genetic factors associated with human kidney disease progression, with three articles presented. Article I describes the identification of loci associated with UACR through trans-ethnic, European-ancestry-specific, and diabetes-specific meta-analyses. An approximate conditional analysis was performed to identify additional independent UACR-associated variants within identified loci. The genome-wide significance level of 𝛼=5×10−8 is used for both primary GWAS association and conditional analyses. However, unlike primary association tests, conditional tests are limited to specific genomic regions surrounding primary GWAS index signals rather than being applied on a genome-wide scale.
In article II, we hypothesized that the application of 𝛼=5×10−8 is overly strict and results in a loss of power. To address this issue, we developed a quasi-adaptive method within a weighted hypothesis testing framework. This method exploits the type I error (𝛼=0.05) by providing less conservative SNP specific 𝛼-thresholds to select secondary signals in conditional analysis. Through simulation studies and power analyses, we demonstrate that the quasi-adaptive method outperforms the established criterion 𝛼=5×10−8 as well as the equal weighting scheme (the Sidak-correction). Furthermore, our method performs well when applied to real datasets and can potentially reveal previously undetected secondary signals in existing data.
In article III, we extended our quasi-adaptive method to identify plausible multiple independent signals at each locus (a secondary signal, a tertiary signal, a signal of 4th, and beyond) and applied it to the publically available GWAS meta-analysis to detect additional multiple independent eGFR-associated signals. The improved quasi-adaptive method successfully identified additional novel replicated independent SNPs that would have gone undetected by applying too conservative genome-wide significance level of 𝛼=5× 10−8. Colocalization analysis based on the novel independent signals identified potentially functional genes across the kidney and other tissues.
Overall, these articles contribute to the understanding of genetic factors associated with human kidney disease progression and provide novel methods for identifying secondary and multiple independent signals in conditional GWAS analyses.