Institut für Mathematik und Informatik
Refine
Year of publication
Document Type
- Doctoral Thesis (57)
- Article (29)
- Final Thesis (1)
Has Fulltext
- yes (87)
Is part of the Bibliography
- no (87)
Keywords
- - (22)
- Statistik (5)
- Numerische Mathematik (4)
- Optimale Kontrolle (4)
- fractal (4)
- permutation entropy (4)
- Bioinformatik (3)
- Fraktal (3)
- Optimale Steuerung (3)
- Selbstähnlichkeit (3)
Institute
Publisher
- MDPI (14)
- Frontiers Media S.A. (6)
- Springer Nature (3)
- BioMed Central (BMC) (2)
- Oxford University Press (1)
- Wiley (1)
Background
The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments.
Results
We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum–Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU.
Conclusions
Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.
The classical Buscher rules d escribe T-duality for metrics and B-fields in a topologically trivial setting. On the other hand, topological T-duality addresses aspects of non-trivial topology while neglecting metrics and B-fields. In this article, we develop a new unifying framework for both aspects.
Convolutional Neural Network-based image classification models are the current state-of-the-art for solving image classification problems. However, obtaining and using such a model to solve a specific image classification problem presents several challenges in practice. To train the model, we need to find good hyperparameter values for training, such as initial model weights or learning rate. However, finding these values is usually a non-trivial process. Another problem is that the training data used for model training is often class-imbalanced in practice. This usually has a negative impact on model training. However, not only is it challenging to obtain a Convolutional Neural Network-based model, but also to use the model after model training. After training, the model might be applied to images that were drawn from a data distribution that is different from the data distribution the training data was drawn from. These images are typically referred to as out-of-distribution samples. Unfortunately, Convolutional Neural Network-based image classification models typically fail to predict the correct class for out-of-distribution samples without warning, which is problematic when such a model is used for safety-critical applications. In my work, I examined whether information from the layers of a Convolutional Neural Network-based image classification model (pixels and activations) can be used to address all of these issues. As a result, I suggest a method for initializing the model weights based on image patches, a method for balancing a class-imbalanced dataset based on layer activations, and a method for detecting out-of-distribution samples, which is also based on layer activations. To test the proposed methods, I conducted extensive experiments using different datasets. My experiments showed that layer information (pixels and activations) can indeed be used to address all of the aforementioned challenges when training and using Convolutional Neural Network-based image classification models.
Phylogenetic (i.e., leaf-labeled) trees play a fundamental role in evolutionary research. A typical problem is to reconstruct such trees from data like DNA alignments (whose columns are often referred to as characters), and a simple optimization criterion for such reconstructions is maximum parsimony. It is generally assumed that this criterion works well for data in which state changes are rare. In the present manuscript, we prove that each binary phylogenetic tree T with n ≥ 20k leaves is uniquely defined by the set Ak (T), which consists of all characters with parsimony score k on T. This can be considered as a promising first step toward showing that maximum parsimony as a tree reconstruction criterion is justified when the number of changes in the data is relatively small.
We consider Walsh’s conformal map from the exterior of a compact set E ⊆ C onto a lemniscatic domain. If E is simply connected, the lemniscatic domain is the exterior of a circle, while if E has several components, the lemniscatic domain is the exterior of a generalized lemniscate and is determined by the logarithmic capacity of E and by the exponents and centers of the generalized lemniscate. For general E, we characterize the exponents in terms of the Green’s function of Ec. Under additional symmetry conditions on E, we also locate the centers of the lemniscatic domain. For polynomial pre-images E = P−1(Ω) of a simply-connected infinite compact set Ω, we explicitly determine the exponents in the lemniscatic domain and derive a set of equations to determine the centers of the lemniscatic domain. Finally, we present several examples where we explicitly obtain the exponents and centers of the lemniscatic domain, as well as the conformal map.
Background
The Earth Biogenome Project has rapidly increased the number of available eukaryotic genomes, but most released genomes continue to lack annotation of protein-coding genes. In addition, no transcriptome data is available for some genomes.
Results
Various gene annotation tools have been developed but each has its limitations. Here, we introduce GALBA, a fully automated pipeline that utilizes miniprot, a rapid protein-to-genome aligner, in combination with AUGUSTUS to predict genes with high accuracy. Accuracy results indicate that GALBA is particularly strong in the annotation of large vertebrate genomes. We also present use cases in insects, vertebrates, and a land plant. GALBA is fully open source and available as a docker image for easy execution with Singularity in high-performance computing environments.
Conclusions
Our pipeline addresses the critical need for accurate gene annotation in newly sequenced genomes, and we believe that GALBA will greatly facilitate genome annotation for diverse organisms.
Background
An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only.
Results
We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity.
Conclusions
An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.
Statistical Methods and Applications for Biomarker Discovery Using Large Scale Omics Data Set
(2023)
This thesis focuses on identifying genetic factors associated with human kidney disease progression, with three articles presented. Article I describes the identification of loci associated with UACR through trans-ethnic, European-ancestry-specific, and diabetes-specific meta-analyses. An approximate conditional analysis was performed to identify additional independent UACR-associated variants within identified loci. The genome-wide significance level of 𝛼=5×10−8 is used for both primary GWAS association and conditional analyses. However, unlike primary association tests, conditional tests are limited to specific genomic regions surrounding primary GWAS index signals rather than being applied on a genome-wide scale.
In article II, we hypothesized that the application of 𝛼=5×10−8 is overly strict and results in a loss of power. To address this issue, we developed a quasi-adaptive method within a weighted hypothesis testing framework. This method exploits the type I error (𝛼=0.05) by providing less conservative SNP specific 𝛼-thresholds to select secondary signals in conditional analysis. Through simulation studies and power analyses, we demonstrate that the quasi-adaptive method outperforms the established criterion 𝛼=5×10−8 as well as the equal weighting scheme (the Sidak-correction). Furthermore, our method performs well when applied to real datasets and can potentially reveal previously undetected secondary signals in existing data.
In article III, we extended our quasi-adaptive method to identify plausible multiple independent signals at each locus (a secondary signal, a tertiary signal, a signal of 4th, and beyond) and applied it to the publically available GWAS meta-analysis to detect additional multiple independent eGFR-associated signals. The improved quasi-adaptive method successfully identified additional novel replicated independent SNPs that would have gone undetected by applying too conservative genome-wide significance level of 𝛼=5× 10−8. Colocalization analysis based on the novel independent signals identified potentially functional genes across the kidney and other tissues.
Overall, these articles contribute to the understanding of genetic factors associated with human kidney disease progression and provide novel methods for identifying secondary and multiple independent signals in conditional GWAS analyses.
Gram-negative bacteria secrete lipopolysaccharides (LPS), leading to a host immune
response of proinflammatory cytokine secretion. Those proinflammatory cytokines are
TNF-α and IFN-γ, which induce the production of indoleamine 2,3-dioxygenase (IDO). IDO production is increased during severe sepsis, and septic shock. High IDO
levels are associated with increased mortality. This enzyme catalyzes the degradation of tryptophan (TRP) to kynurenine (KYN) along the kynurenine pathway (KP).
KYN is further degraded to kynurenic acid (KYNA). Increased IDO levels accompany
with increased levels of KYNA, which is associated with immunoparalysis.
Due to its central role, the KP is a potential target of therapeutic intervention.
The degradation of TRP to KYN by IDO was intervened by 1-Methyltryptophan (1-
MT), which is assumed to inhibit IDO. By administering 1-MT, the survival of
1-MT-treated mice suffering from sepsis increased compared to mice not treated with
1-MT. The levels of downstream metabolites such as KYN and KYNA were
expected to be decreased. Surprisingly, in healthy mice and pigs, an increase in KYNA
after 1-MT administration was reported. Those unexpected metabolite alterations after 1-MT administration, and the mode of action, were not the focus of recent
research. Hence, there is no explanation for KYNA increase, while KYN did not change.
This thesis aims to postulate a possible degradation pathway of 1-MT along the KP
with the help of ordinary differential equation (ODE) systems.
Moreover, the developed ODE models were used to determine the ability of 1-MT to
inhibit IDO in vivo. Therefore, a multiplicity of ODE models were developed, including
a model of the KP, an extension by lipopolysaccharide (LPS) administration, and 1-MT
administration.
Moreover, seven ODE models were developed, all considering possible degradation pathways of 1-MT. The most likely degradation pathway was combined with the ODE model
of LPS administration, including the inhibitory effects of 1-MT.
Those models consist of several dependent equations describing the dynamics of the KP.
For each component of the KP, one equation describes the alterations over time. Equations for TRP, KYN, KYNA, and quinolinic acid (QUIN) were developed.
Moreover, the alterations of serotonin (SER) were also included. All together belong
to the TRP metabolism. They include the degradation of TRP to SER and to KYN,
which is further degraded to KYNA and QUIN. Every degradation is catalyzed by an enzyme. Therefore, Michaelis-Menten (MM) equations were used employing the substrate
constant Km and the maximal degradation velocity Vmax. To reduce the complexity of
parameter calculation, Km values of the different enzymes were fixed to literature values.
The remaining parameters of the equations were determined so that the trajectories of
the calculated metabolite levels correspond to data. The parameters of different models were determined. To propose a degradation pathway of 1-MT leading to increased
KYNA levels, seven models were developed and compared. The most likely model was
extended to test whether the inhibitory effects of 1-MT on IDO can be determined.
Three different approaches determined the ODE model parameters of the different hypothesis of 1-MT degradation. In the first approach, ODE model parameters were fixed
to values fitted to an independent data set. In the second approach, parameters were
fitted to a subset of the data set, which was used for simulations of the different hypotheses. The third approach calculated ODE model parameters 100 times without
fixed parameters. The parameter set ending up in trajectories of the TRP metabolites,
which have the smallest distance to the data, was assumed to be the most likely. The
ODE model parameters were fitted to data measured in pigs. Two different
experimental models delivered data used in this thesis. The first experimental model
activates IDO by LPS administration in pigs. The second one combines the IDO
activation by LPS with the administration of 1-MT in pigs.
The most likely hypothesis, according to approach 1 was the degradation of 1-MT to
KYNA and TRP. For the second data set the most likely one was the direct degradation of 1-MT to KYNA. With approach 2 the most likely degradation pathways were
the combination of all degradation pathways and the degradation of 1-MT to TRP and
TRP to KYNA. With approach 3 the most likely way of KYNA increase was given by
the direct degradation of 1-MT to KYNA. In summary, the three approaches revealed
hypothesis 2, the direct degradation of 1-MT to KYNA most frequently. A cell-free
assay validated this result. This experiment combined 1-MT or TRP with or without
the enzyme kynurenine aminotransferase (KAT). KAT was already shown to degrade
TRP directly to KYNA. The levels of TRP, KYN and KYNA were measured. The
highest KYNA levels were yielded with an assay adding KAT to 1-MT, corresponding
to hypothesis 2. The models describing the inhibitory effects of 1-MT revealed that
the model without inhibitory effects of 1-MT on IDO was more likely for all three approaches.
The correctness of hypothesis 2 has to be confirmed by further in vitro experiments. It
also has to be investigated which reactions promote the degradation of 1-MT to KYNA.
The missing inhibitory properties of 1-MT on IDO, determined by the in silico ODE
models, align with previous research. It was shown that the saturation of 1-MT was too
low, e.g. in pigs, to inhibit IDO efficiently.
In this study, the first possible degradation pathway of 1-MT along the KP is proposed.
The reliability of the results depends on the quality of the experimental data, and the
season, when data were measured. Moreover, the results vary between the different
approaches of parameter fitting. Different approaches of parameter fitting have to be
included in the analysis to get more evidence for the correctness of the results.
Tafazzin is an acyltransferase with key functions in remodeling of the mitochondrial phospholipid cardiolipin (CL) by exchanging single fatty acids species in CL. Tafazzin-mediated CL remodeling determines the actual CL compositions and has been implicated in mitochondrial morphology and function. Thus, any deficiency of tafazzin leads to altered fatty acid composition of CL which is directly associated with impaired mitochondrial respiration and ATP production. Mutations in the tafazzin encoding gene TAZ, are the cause of the severe X-linked genetic disease, BARTH syndrome (BTHS).
Previous work provided first hints on a linkage of CL composition and subsequent limitations in the cellular ATP levels which may contribute to the restriction of growth. However, in C6 cells ATP levels remained unaltered due to compensatory activation of glycolysis. Moreover, it has been demonstrated that the substantial changes in CL composition are similarly resulting from knocking down either cardiolipin synthase (CRLS) or TAZ. This has also been shown in C6 glioma cells. Most notably only the knock down of TAZ, but not that of CRLS, compromised proliferation of C6 glioma cells. Therefore, a CL- independent role of TAZ in regulating cell proliferation is postulated.
In this study, any linkage of the lack of tafazzin to cellular proliferation should be investigated in more detail to allow first insight into underlying mechanisms.
The results of the current study demonstrate that the tafazzin knockout in C6 glioma cells show changes in global gene expression by applying transcriptome analysis using the- microarray Clarion S rat Affymetrix array. Out of 22,076 total number of genes detected, 1,099 genes were differentially expressed in C6 knockout cells which were either ≥2 and ≥4 fold up or down regulated genes. Furthermore, expression of selected target genes was validated using RT-qPCR. We have hypothesised that the changes in TAZ dependent gene expression is via PPAR transcription factor. According to eukaryotic promoter database (EPD) for selected target genes, exhibited at least one putative binding site for PPARG and PPARA transcription factors. However, pioglitazone and LG100268, synthetic ligands of PPARG and RXR, could not show any effect on changes in gene expression in C6 TAZ cells. Another class of cellular lipids, oxylipins were found to occur in significantly higher amounts in C6 TAZ cells compared to C6 cells which makes them candidates for mediating cellular effects and regulating gene expression via PPARs. A computational tool CiiiDER was used to for the prediction of transcription factor binding site. The transcription factors enriched in TAZ- regulated genes were found to be HOXA5 and PAX2, binding sites of which could be detected in 100 % of TAZ- regulated genes (>2-fold). By applying IPA to the differentially expressed genes we could identify lipid metabolism, and cholesterol superpathway in particular as the most affected pathway in C6 TAZ cells. This pathway consists of 20 genes, of which all (20/20) appeared to be differentially regulated in C6 TAZ cells. Of all the 20 genes, 4 of the differentially expressed genes were selected for further validation by RT-qPCR. By IPA it was possible to identify the upstream regulators that might be responsible for the differential expression of genes in C6 deficient cells. Some of the genes ACACA, HMGCR, FASN, ACSL1, 3 and, 5 identified was decreased by predicted activation and inhibition of the regulators. Further we have analysed the levels of cellular cholesterol content in C6 and C6 TAZ (w/o Δ5 and FL) cells. In C6 cells cholesterol is present more in its free form. C6 TAZ cells have increased amount of cholesterol compared to C6 cells. However, Δ5 and FL expressed C6 TAZ cells showed less amount of cholesterol.
Previous work established that knockout of tafazzin in C6 cells showed decreased cell proliferation in the absence of any changes in ATP content. To understand this phenomenon cellular senescence associated β-galactosidase in C6 and C6 TAZ cells was performed. C6 TAZ cells showed increased percentage of β-gal positive cells compared to C6 cells. Moreover, senescent associated secretory phenotype (SASP) represented by e.g. CXCL1, IL6, and IL1α was determined using RT-qPCR. Gene expression of these SASP factors was significantly upregulated in C6 TAZ cells.
Several human tafazzin isoforms exists due to alternate splicing. However, whether these isoforms differ in function and in CL remodelling activity or specificity, in particular, is unknown. The purpose of this work was to determine if specific isoforms, such as human isoform lacking exon 5 (Δ5), rat full length tafazzin (FL) and enzymatically dead full length tafazzin (H69L), can restore the wild type phenotype in terms of CL composition, cellular proliferation, and gene expression profile. Therefore, in the second part, it was demonstrated that expression of Δ5 to some extent and rat full length tafazzin can completely restore CL composition, in C6 TAZ cells which is naturally linked to the restoration of mitochondrial respiration. As expected, a comparable restoration of CL composition could not be seen after re-expressing an enzymatically dead full-length rat TAZ, (H69L; TAZ Mut). Furthermore, re-expression of the TAZ Mut largely failed to reverse the alterations in gene expression, in contrast re-expression of the TAZ FL and the Δ5 isoforms reversed gene expression to a larger extent. Moreover, only rat full length TAZ was able to reverse proliferation rate. Surprisingly, the expression of Δ5 in C6 TAZ cells did not promote proliferation of the wild type. Different effects of Δ5 and FL on CL composition and cell proliferation points to the specific and in part non-enzymatic functions of tafazzin isoforms, but this certainly requires further analysis.