OPUS 4 | Search

learnMSA: learning and aligning large protein families (2022)

Background The alignment of large numbers of protein sequences is a challenging task and its importance grows rapidly along with the size of biological datasets. State-of-the-art algorithms have a tendency to produce less accurate alignments with an increasing number of sequences. This is a fundamental problem since many downstream tasks rely on accurate alignments. Results We present learnMSA, a novel statistical learning approach of profile hidden Markov models (pHMMs) based on batch gradient descent. Fundamentally different from popular aligners, we fit a custom recurrent neural network architecture for (p)HMMs to potentially millions of sequences with respect to a maximum a posteriori objective and decode an alignment. We rely on automatic differentiation of the log-likelihood, and thus, our approach is different from existing HMM training algorithms like Baum–Welch. Our method does not involve progressive, regressive, or divide-and-conquer heuristics. We use uniform batch sampling to adapt to large datasets in linear time without the requirement of a tree. When tested on ultra-large protein families with up to 3.5 million sequences, learnMSA is both more accurate and faster than state-of-the-art tools. On the established benchmarks HomFam and BaliFam with smaller sequence sets, it matches state-of-the-art performance. All experiments were done on a standard workstation with a GPU. Conclusions Our results show that learnMSA does not share the counterintuitive drawback of many popular heuristic aligners, which can substantially lose accuracy when many additional homologs are input. LearnMSA is a future-proof framework for large alignments with many opportunities for further improvements.

Defining Binary Phylogenetic Trees Using Parsimony (2022)

Fischer, Mareike

Phylogenetic (i.e., leaf-labeled) trees play a fundamental role in evolutionary research. A typical problem is to reconstruct such trees from data like DNA alignments (whose columns are often referred to as characters), and a simple optimization criterion for such reconstructions is maximum parsimony. It is generally assumed that this criterion works well for data in which state changes are rare. In the present manuscript, we prove that each binary phylogenetic tree T with n ≥ 20k leaves is uniquely defined by the set Ak (T), which consists of all characters with parsimony score k on T. This can be considered as a promising first step toward showing that maximum parsimony as a tree reconstruction criterion is justified when the number of changes in the data is relatively small.

Walsh’s Conformal Map onto Lemniscatic Domains for Polynomial Pre-images I (2022)

Schiefermayr, Klaus ; Sète, Olivier

We consider Walsh’s conformal map from the exterior of a compact set E ⊆ C onto a lemniscatic domain. If E is simply connected, the lemniscatic domain is the exterior of a circle, while if E has several components, the lemniscatic domain is the exterior of a generalized lemniscate and is determined by the logarithmic capacity of E and by the exponents and centers of the generalized lemniscate. For general E, we characterize the exponents in terms of the Green’s function of Ec. Under additional symmetry conditions on E, we also locate the centers of the lemniscatic domain. For polynomial pre-images E = P−1(Ω) of a simply-connected infinite compact set Ω, we explicitly determine the exponents in the lemniscatic domain and derive a set of equations to determine the centers of the lemniscatic domain. Finally, we present several examples where we explicitly obtain the exponents and centers of the lemniscatic domain, as well as the conformal map.

Global, highly specific and fast filtering of alignment seeds (2022)

Ebel, Matthis ; Migliorelli, Giovanna ; Stanke, Mario

Background An important initial phase of arguably most homology search and alignment methods such as required for genome alignments is seed finding. The seed finding step is crucial to curb the runtime as potential alignments are restricted to and anchored at the sequence position pairs that constitute the seed. To identify seeds, it is good practice to use sets of spaced seed patterns, a method that locally compares two sequences and requires exact matches at certain positions only. Results We introduce a new method for filtering alignment seeds that we call geometric hashing. Geometric hashing achieves a high specificity by combining non-local information from different seeds using a simple hash function that only requires a constant and small amount of additional time per spaced seed. Geometric hashing was tested on the task of finding homologous positions in the coding regions of human and mouse genome sequences. Thereby, the number of false positives was decreased about million-fold over sets of spaced seeds while maintaining a very high sensitivity. Conclusions An additional geometric hashing filtering phase could improve the run-time, accuracy or both of programs for various homology-search-and-align tasks.

Relevance and Regulation of Alternative Splicing in Plant Heat Stress Response: Current Understanding and Future Directions (2022)

Rosenkranz, Remus R. E. ; Ullrich, Sarah ; Löchli, Karin ; Simm, Stefan ; Fragkostefanakis, Sotirios

Alternative splicing (AS) is a major mechanism for gene expression in eukaryotes, increasing proteome diversity but also regulating transcriptome abundance. High temperatures have a strong impact on the splicing profile of many genes and therefore AS is considered as an integral part of heat stress response. While many studies have established a detailed description of the diversity of the RNAome under heat stress in different plant species and stress regimes, little is known on the underlying mechanisms that control this temperature-sensitive process. AS is mainly regulated by the activity of splicing regulators. Changes in the abundance of these proteins through transcription and AS, post-translational modifications and interactions with exonic and intronic cis-elements and core elements of the spliceosomes modulate the outcome of pre-mRNA splicing. As a major part of pre-mRNAs are spliced co-transcriptionally, the chromatin environment along with the RNA polymerase II elongation play a major role in the regulation of pre-mRNA splicing under heat stress conditions. Despite its importance, our understanding on the regulation of heat stress sensitive AS in plants is scarce. In this review, we summarize the current status of knowledge on the regulation of AS in plants under heat stress conditions. We discuss possible implications of different pathways based on results from non-plant systems to provide a perspective for researchers who aim to elucidate the molecular basis of AS under high temperatures.

Detection of Low MAP Shedder Prevalence in Large Free-Stall Dairy Herds by Repeated Testing of Environmental Samples and Pooled Milk Samples (2022)

Wichert, Annika ; Kasbohm, Elisa ; Einax, Esra ; Wehrend, Axel ; Donat, Karsten

Simple Summary Paratuberculosis is a disease which affects ruminants worldwide. Many countries have implemented certification and monitoring systems to control the disease, particularly in dairy herds. Monitoring herds certified as paratuberculosis non-suspect is an important component of paratuberculosis herd certification programs. The challenge is to detect the introduction or reintroduction of the infectious agent as early as possible with reasonable efforts but high certainty. In our study, we evaluated different low-cost testing schemes in herds where the share of infected animals was low, resulting in a low within-herd prevalence of animals shedding the bacteria that causes paratuberculosis in their feces. The test methods used were repeated pooled milk samples and fecal samples from the barn environment. Our study showed that numerous repetitions of different samples are necessary to monitor such herds with sufficiently high certainty. In the case of herds with a very low prevalence, our study showed that a combination of different sampling approaches is required. Abstract An easy-to-use and affordable surveillance system is crucial for paratuberculosis control. The use of environmental samples and milk pools has been proven to be effective for the detection of Mycobacterium avium subsp. paratuberculosis (MAP)-infected herds, but not for monitoring dairy herds certified as MAP non-suspect. We aimed to evaluate methods for the repeated testing of large dairy herds with a very low prevalence of MAP shedders, using different sets of environmental samples or pooled milk samples, collected monthly over a period of one year in 36 herds with known MAP shedder prevalence. Environmental samples were analyzed by bacterial culture and fecal PCR, and pools of 25 and 50 individual milk samples were analyzed by ELISA for MAP-specific antibodies. We estimated the cumulative sensitivity and specificity for up to twelve sampling events by adapting a Bayesian latent class model and taking into account the between- and within-test correlation. Our study revealed that at least seven repeated samplings of feces from the barn environment are necessary to achieve a sensitivity of 95% in herds with a within-herd shedder prevalence of at least 2%. The detection of herds with a prevalence of less than 2% is more challenging and, in addition to numerous repetitions, requires a combination of different samples.

Application of a maximal-clique based community detection algorithm to gut microbiome data reveals driver microbes during influenza A virus infection (2022)

Bhar, Anirban ; Gierse, Laurin Christopher ; Meene, Alexander ; Wang, Haitao ; Karte, Claudia ; Schwaiger, Theresa ; Schröder, Charlotte ; Mettenleiter, Thomas C. ; Urich, Tim ; Riedel, Katharina ; Kaderali, Lars

Influenza A Virus (IAV) infection followed by bacterial pneumonia often leads to hospitalization and death in individuals from high risk groups. Following infection, IAV triggers the process of viral RNA replication which in turn disrupts healthy gut microbial community, while the gut microbiota plays an instrumental role in protecting the host by evolving colonization resistance. Although the underlying mechanisms of IAV infection have been unraveled, the underlying complex mechanisms evolved by gut microbiota in order to induce host immune response following IAV infection remain evasive. In this work, we developed a novel Maximal-Clique based Community Detection algorithm for Weighted undirected Networks (MCCD-WN) and compared its performance with other existing algorithms using three sets of benchmark networks. Moreover, we applied our algorithm to gut microbiome data derived from fecal samples of both healthy and IAV-infected pigs over a sequence of time-points. The results we obtained from the real-life IAV dataset unveil the role of the microbial families Ruminococcaceae, Lachnospiraceae, Spirochaetaceae and Prevotellaceae in the gut microbiome of the IAV-infected cohort. Furthermore, the additional integration of metaproteomic data enabled not only the identification of microbial biomarkers, but also the elucidation of their functional roles in protecting the host following IAV infection. Our network analysis reveals a fast recovery of the infected cohort after the second IAV infection and provides insights into crucial roles of Desulfovibrionaceae and Lactobacillaceae families in combating Influenza A Virus infection. Source code of the community detection algorithm can be downloaded from https://github.com/AniBhar84/MCCD-WN.

Elementary Fractal Geometry. 2. Carpets Involving Irrational Rotations (2022)

Bandt, Christoph ; Mekhontsev, Dmitry

Self-similar sets with the open set condition, the linear objects of fractal geometry, have been considered mainly for crystallographic data. Here we introduce new symmetry classes in the plane, based on rotation by irrational angles. Examples without characteristic directions, with strong connectedness and small complexity, were found in a computer-assisted search. They are surprising since the rotations are given by rational matrices, and the proof of the open set condition usually requires integer data. We develop a classification of self-similar sets by symmetry class and algebraic numbers. Examples are given for various quadratic number fields.

Computer Vision for Detection of Body Posture and Behavior of Red Foxes (2022)

Schütz, Anne K. ; Krause, E. Tobias ; Fischer, Mareike ; Müller, Thomas ; Freuling, Conrad M. ; Conraths, Franz J. ; Homeier-Bachmann, Timo ; Lentz, Hartmut H. K.

Simple Summary Monitoring animal behavior provides an indicator of their health and welfare. For this purpose, video surveillance is an important method to get an unbiased insight into behavior, as animals often show different behavior in the presence of humans. However, manual analysis of video data is costly and time-consuming. For this reason, we present a method for automated analysis using computer vision—a method for teaching the computer to see like a human. In this study, we use computer vision to detect red foxes and their body posture (lying, sitting, or standing). With this data we are able to monitor the animals, determine their activity, and identify their behavior. Abstract The behavior of animals is related to their health and welfare status. The latter plays a particular role in animal experiments, where continuous monitoring is essential for animal welfare. In this study, we focus on red foxes in an experimental setting and study their behavior. Although animal behavior is a complex concept, it can be described as a combination of body posture and activity. To measure body posture and activity, video monitoring can be used as a non-invasive and cost-efficient tool. While it is possible to analyze the video data resulting from the experiment manually, this method is time consuming and costly. We therefore use computer vision to detect and track the animals over several days. The detector is based on a neural network architecture. It is trained to detect red foxes and their body postures, i.e., ‘lying’, ‘sitting’, and ‘standing’. The trained algorithm has a mean average precision of 99.91%. The combination of activity and posture results in nearly continuous monitoring of animal behavior. Furthermore, the detector is suitable for real-time evaluation. In conclusion, evaluating the behavior of foxes in an experimental setting using computer vision is a powerful tool for cost-efficient real-time monitoring.

Geometric T-Duality (2022)

Kunath, Malte Arthur

From a physicists point of view T-duality is a relation connecting string theories on different spacetimes. Mathematically speaking, T-duality should be a symmetric relation on the space of toroidal string backgrounds. Such a background consists of: a smooth manifold M; a torus bundle E over M - the total space modelling spacetime; a Riemannian metric g on E - modelling the field of gravity; a U(1)-bundle gerbe G with connection over E - modelling the Kalb- Ramond field. But as of now no complete model for T-duality exists. The three most notable approaches for T-duality are given by the differential approaches by Buscher in the form of the Buscher rules and by Bouwknegt, Evslin and Mathai in the form of T-duality with H-flux on the one hand, and by the topological approach given by Bunke, Rumpf and Schick which is known as topological T-duality. In this thesis we combine these different approaches to form the first model for T-duality over complete geometric toroidal string backgrounds and we will introduce an example for this geometric T-duality inspired by the Hopf bundle.

Discovering Latent Structure in High-Dimensional Healthcare Data: Toward Improved Interpretability (2022)

Becker, Ann-Kristin

This cumulative thesis describes contributions to the field of interpretable machine learning in the healthcare domain. Three research articles are presented that lie at the intersection of biomedical and machine learning research. They illustrate how incorporating latent structure can provide a valuable compression of the information hidden in complex healthcare data. Methodologically, this thesis gives an overview of interpretable machine learning and the discovery of latent structure, including clusters, latent factors, graph structure, and hierarchical structure. Different workflows are developed and applied to two main types of complex healthcare data (cohort study data and time-resolved molecular data). The core result builds on Bayesian networks, a type of probabilistic graphical model. On the application side, we provide accurate predictive or discriminative models focusing on relevant medical conditions, related biomarkers, and their interactions.

Modeling spatial patterns of survival, space use and recovery probability (2022)

Schirmer, Saskia

Spatial variation in survival has individual fitness consequences and influences population dynamics. It proximately and ultimately impacts space use including migratory connectivity. Therefore, knowing spatial patterns of survival is crucial to understand demography of migrating animals. Extracting information on survival and space use from observation data, in particular dead recovery data, requires explicitly identifying the observation process. The main aim of this work is to establish a modeling framework which allows estimating spatial variation in survival, migratory connectivity and observation probability using dead recovery data. We provide some biological background on survival and migration and a short methodological overview of how similar situations are modeled in literature. Afterwards, we provide REML-like estimators for discrete space and show identifiability of all three parameters using the characteristics of the multinomial distribution. Moreover, we formulate a model in continuous space using mixed binomial point processes. The continuous model assumes a constant recovery probability over space. To drop this strict assumption, we develop an optimization procedure combining the discrete and continuous space model. Therefore, we use penalized M-splines. In simulation studies we demonstrate the performance of the estimators for all three model approaches. Furthermore, we apply the models to real-world data sets of European robins \textit{Erithacus rubecula} and ospreys \textit{Pandion haliaetus}. We discuss how this study can be embedded in the framework of animal movement and the capture mark recapture/recovery methodology. It can be seen as a contribution and an extension to distance sampling, local stationary everyday movement and dispersal. We emphasize the importance of having a mathematically clearly formulated modeling framework for applied methods. Moreover, we comment on model assumptions and their limits. In the future, it would be appealing to extend this framework to the full annual cycle and carry-over effects.

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Institute

Publisher

12 search hits