004 Datenverarbeitung; Informatik
Refine
Document Type
- Doctoral Thesis (6)
- Final Thesis (1)
Has Fulltext
- yes (7)
Is part of the Bibliography
- no (7)
Keywords
- Bioinformatik (3)
- Aging (1)
- Algorithmen (1)
- Alterung (1)
- Antrieb (1)
- Artificial neural networks (1)
- Bildklassifikation (1)
- Bioinformatics (1)
- Brustkrebs (1)
- Class-imbalanced Data (1)
Convolutional Neural Network-based image classification models are the current state-of-the-art for solving image classification problems. However, obtaining and using such a model to solve a specific image classification problem presents several challenges in practice. To train the model, we need to find good hyperparameter values for training, such as initial model weights or learning rate. However, finding these values is usually a non-trivial process. Another problem is that the training data used for model training is often class-imbalanced in practice. This usually has a negative impact on model training. However, not only is it challenging to obtain a Convolutional Neural Network-based model, but also to use the model after model training. After training, the model might be applied to images that were drawn from a data distribution that is different from the data distribution the training data was drawn from. These images are typically referred to as out-of-distribution samples. Unfortunately, Convolutional Neural Network-based image classification models typically fail to predict the correct class for out-of-distribution samples without warning, which is problematic when such a model is used for safety-critical applications. In my work, I examined whether information from the layers of a Convolutional Neural Network-based image classification model (pixels and activations) can be used to address all of these issues. As a result, I suggest a method for initializing the model weights based on image patches, a method for balancing a class-imbalanced dataset based on layer activations, and a method for detecting out-of-distribution samples, which is also based on layer activations. To test the proposed methods, I conducted extensive experiments using different datasets. My experiments showed that layer information (pixels and activations) can indeed be used to address all of the aforementioned challenges when training and using Convolutional Neural Network-based image classification models.
Age is the single biggest risk factor for most major human diseases. As such, understanding the intricate molecular changes that drive biological aging holds great promise in attempting to slow
the onset of systemic diseases and thereby increase the effective health-span in modern societies.
This thesis explores several computational approaches to capture and analyze the molecular biological alterations triggered by intrinsic and extrinsic aging using skin as a model tissue to deliver genes and pathways as potential targets for intervention strategies.
Publication 1 demonstrates the utility of multi-omics data integration strategies for aging research, leading to the identification of four latent aging phases in skin tissue through an integrated cluster analysis of gene expression and DNA methylation data. The four phases improved the detection of molecular aging signals and were shown to be associated with sunbathing habits of the test subjects. Deeper analysis revealed extensive non-linear alterations in various biological pathways particularly at the transition into the fourth aging phase, coinciding with menopause, with potentially wide-reaching functional implications. Publication 2 describes the development of a novel type of age clock, that provides a new level of interpretability by embedding biological pathway information in the architecture of an artificial neural network. The clock not only generates meaningful biological age estimates from gene expression data, but further allows simultaneous monitoring of the aging states of various biological processes through the activations of intermediate neurons. Analyses of the inner workings of the clock revealed a wide-spread impact of aging on the global pathway landscape. Simulation experiments using the transcriptomic clock recapitulated known functional aging gene associations and allowed deciphering of the pathways by which accelerated aging conditions such as chronic sun exposure and Hutchinson-Gilford progeria syndrome exert their effects. Publication 3 further explores the molecular alterations caused by the pro-aging effector UV irradiation in the skin. The multi-omics data analysis of repetitively irradiated skin revealed signs of the immediate acquisition of aging- and cancer-related epigenetic signatures and concurrent wide-spread transcriptional changes across various biological processes. Investigations into the varying resilience to irradiation between subjects revealed prognostic biomarker signatures capable of predicting individual UV tolerances, with accuracies far surpassing the traditional Fitzpatrick classification scheme. Further analysis of the transcripts and pathways associated with UV tolerance identified a form of melanin-independent DNA damage protection in individuals with higher innate UV resilience.
Together, the approaches and findings described in this thesis explore several new angles to advance our understanding of aging processes and external drivers of aging such as UV irradiation in the human skin and deliver new insight on target genes and pathways involved.
Background
Previous work has focused on speckle-tracking echocardiography (STE)-derived global longitudinal and circumferential peak strain as potential superior prognostic metric markers compared with left ventricular ejection fraction (LVEF). However, the value of regional distribution and the respective orientation of left ventricular wall motion (quantified as strain and derived from STE) for survival prediction have not been investigated yet. Moreover, most of the recent studies on risk stratification in primary and secondary prevention do not use neural networks for outcome prediction.
Purpose
To evaluate the performance of neural networks for predicting all cause-mortality with different model inputs in a moderate-sized general population cohort.
Methods
All participants of the second cohort of the population-based Study of Health in Pomerania (SHIP-TREND-0) without prior cardiovascular disease (CVD; acute myocardial infarction, cardiac surgery/intervention, heart failure and stroke) and with transthoracic echocardiography exams were followed for all-cause mortality from baseline examination (2008-2012) until 2019.
A novel deep neural network architecture ‘nnet-Surv-rcsplines’, that extends the Royston-Parmar- cubic splines survival model to neural networks was proposed and applied to predict all-cause mortality from STE-derived global and/or regional myocardial longitudinal, circumferential, transverse, and radial strain in addition to the components of the ESC SCORE model. The models were evaluated by 8.5-year area-under-the-receiver-operating-characteristic (AUROC) and (scaled) Brier score [(S)BS]and compared to the SCORE model adjusted for mortality rates in Germany in 2010.
Results
In total, 3858 participants (53 % female, median age 51 years) were followed for a median time of 8.4 (95 % CI 8.3 – 8.5) years. Application of ‘nnet-Surv-rcsplines’ to the components of the ESC SCORE model alone resulted in the best discriminatory performance (AUROC 0.9 [0.86-0.91]) and lowest prediction error (SBS 21[18-23] %). The latter was significantly lower (p <0.001) than the original SCORE model (SBS 11 [9.5 - 13] %), while discrimination did not differ significantly. There was no difference in (S)BS (p= 0.66) when global circumferential and longitudinal strain were added to the model. Solely including STE-data resulted in an informative (AUROC 0.71 [0.69, 0.74]; SBS 3.6 [2.8-4.6] %) but worse (p<0.001) model performance than when considering the sociodemographic and instrumental biomarkers, too.
Conclusion
Regional myocardial strain distribution contains prognostic information for predicting all-cause mortality in a primary prevention sample of subjects without CVD. Still, the incremental prognostic value of STE parameters was not demonstrated. Application of neural networks on available traditional risk factors in primary prevention may improve outcome prediction compared to standard statistical approaches and lead to better treatment decisions.
A lot of research data has become available since the outbreak of the COVID-19
pandemic in 2019. Connecting this data is essential for the understanding of the
SARS-CoV-2 virus and the fight against the pandemic.
Amongst biological and biomedical research data, computational models targeting
COVID-19 have been emerging and their number is growing constantly. They are a
central part of the field of Systems Biology, which aims to understand the mechanisms
and behaviour of biological systems. Model predictions help to understand the
mechanisms of the novel coronavirus and the life-threatening disease it is causing.
Both biomedical research data and modelling data regarding COVID-19 have
previously been stored in separated domain-specific graph databases. MaSyMoS,
short for Management System for Models and Simulations, is a graph database for
storing simulation studies of biological and biochemical systems. The CovidGraph
project integrates research data regarding COVID-19 and the coronavirus family
from various data resources in a knowledge graph.
In this thesis, we integrate simulation models from MaSyMoS, including models
targeting COVID-19, into the CovidGraph. Therefore, we present a concept for
the integration of simulation studies and the linkage through ontology terms and
reference publications in the CovidGraph. Ultimately, we connect data from the field
of systems biology and biomedical research data in a graph database.
In this doctoral thesis, algorithms are presented that are designed for the investigation in the mesopause region between the upper Mesosphere and Lower Thermosphere (MLT). The photochemical models are proposed and applied to represent the oxygen airglow and the oxygen photochemistry in the MLT. Atomic oxygen, O, in the ground state, O(3P), is of special interest because it is a reactive trace gas actively contributing to the Earth’s airglow. The retrievals of O(3P) concentrations, [O(3P)], are based on the nightglow time series of the green line emission measured remotely as in the first part of this thesis and the individual profiles of multiple nightglow emissions of O and molecular oxygen (O2) measured in situ as in the second part of this thesis. To process the complete spectral time series measured by using the satellite-borne instrument SCIAMACHY (SCanning Imaging Absorption spectroMeter for Atmospheric CHartographY), an intricate set of algorithms is developed and applied with the regularized total least squares minimization approach to estimate a set of the optimal regularization parameters and to retrieve a corresponding set of vertical Volume Emission Rate (VER) profiles. Furthermore, these algorithms take emissions of another origin and the Earth's shape into account. Considering not identified states of O2, the established photochemical models are adjusted resulting in two model modifications. Both model modifications are employed to retrieve the [O(3P)] time series on the basis of the VER time series in the MLT. The model input parameters vary in the atmosphere that motivated to propose these two model modifications and to employ available sources of the input parameters. One semi-empirical model, one general circulation model and the satellite-borne instrument SABER (Sounding of the Atmosphere using Broadband Emission Radiometry) are employed as sources of the reference [O(3P)] and input parameters time series. The SABER instrument employed as a source of the input parameters is preferred according to the comparison of the retrieved and reference [O(3P)] time series. Studying the impact of the 11-year solar cycle on O(3P) in the MLT, an algorithm is developed and applied with the Levenberg-Marquardt algorithm to estimate the optimal fit parameters step-wise. The results of the O(3P) sensitivity analysis obtained with respect to the solar activity forcing at the 11 year and 27 day time scales and the lunar gravitational forcing agree with the reference model simulations. The hypothesis regarding vertical shifts between different of Meinel bands at least partly caused by the hydroxyl radical (OH*) quenching with O(3P) is confirmed experimentally. Based on the conclusion drawn in the first part of this thesis that the data sets’ self-consistency is high as for the averaged SABER and SCIAMACHY measurements, a comprehensive set of available data with a higher level of the data sets’ self-consistency is employed in the second part of this thesis. Multiple airglow emissions measured in situ during four campaigns are employed to propose the Multiple Airglow Chemistry (MAC) model. Processed emissions are the Herzberg I, Chamberlain, Atmospheric and Infrared Atmospheric band emissions of O2 and the green line emission of O. Considering all widely known and additionally complemented reactions, the MAC model is proposed to represent the oxygen airglow and the oxygen photochemistry in the MLT. The presented MAC model is based on the hypothesis of Slanger et al. (2004) stating that higher excited states of O2 are coupled with each other through vibronic de-excitation caused by collisions among molecules of this group of O2 states in the MLT. This hypothesis is modified excluding the singlet Herzberg state of O2 from the group of O2 states considered by Slanger et al. (2004). The MAC calculations are carried out sequentially starting with higher excited O2 states to provide the retrieved output concentrations of these O2 states as the input concentrations to the next calculation steps. The final step is only based on concentrations of all species, whereas each of the earlier steps is based on a corresponding VER profile besides of the input concentrations. The oxygen photochemistry in the MLT is represented by all species considered at the final step that makes it possible to adopt the MAC reactions in a general circulation model. Four modifications of the MAC model, i.e. including or excluding the triplet Herzberg states of O2 and including or excluding ozone and odd hydrogen (hydrogen, OH* and hydroperoxy radical), lead to negligible differences in the retrieved [O(3P)] profiles. Based on the MAC calculations verified and validated on the basis of the four rocket campaigns, one of the effective modifications of the MAC model (excluding the triplet Herzberg states of O2, ozone and odd hydrogen) is further reduced to the most effective modification. This implies that for the [O(3P)] retrieval only the O2 Atmospheric band emission, temperature and concentrations of molecular nitrogen (N2) and O2 are sufficient to apply. Calculations carried out by using the most effective modification of the MAC model are verified and validated on the basis of self-consistent in situ measurements obtained simultaneously. The MAC model enables identifying precursors of (1) the three lowest O2 valence states and (2) the second excited O state responsible for (1) the Atmospheric and Infrared Atmospheric band emissions of O2 and (2) the green line emission of O, respectively. Particularly, the singlet Herzberg state of O2 is identified as the major precursor of the second excited O state resulting in the green line emission. In focus of potential further research is an extension of the MAC model with vibrationally excited states of O2 and ionized species.
Approaches to the Analysis of Proteomics and Transcriptomics Data based on Statistical Methodology
(2014)
Recent developments in genomics and molecular biology led to the generation of an enormous amount of complex data of different origin. This is demonstrated by a number of published results from microarray experiments in Gene Expression Omnibus. The number was growing in exponential pace over the last decade. The challenge of interpreting these vast amounts of data from different technologies led to the development of new methods in the fields of computational biology and bioinformatics. Researchers often want to represent biological phenomena in the most detailed and comprehensive way. However, due to the technological limitations and other factors like limited resources this is not always possible. On one hand, more detailed and comprehensive research generates data of high complexity that is very often difficult to approach analytically, however, giving bioinformatics a chance to draw more precise and deeper conclusions. On the other hand, for low-complexity tasks the data distribution is known and we can fit a mathematical model. Then, to infer from this mathematical model, researchers can use well-known and standard methodologies. In return for using standard methodologies, the biological questions we are answering might not be unveiling the whole complexity of the biological meaning. Nowadays it is a standard that a biological study involves generation of large amounts of data that needs to be analyzed with a statistical inference. Sometimes data challenge researchers with low complexity task that can be performed with standard and popular methodologies as in Proteomic analysis of mouse oocytes reveals 28 candidate factors of the "reprogrammome". There, we established a protocol for proteomics data that involves preprocessing of the raw data and conducting Gene Ontology overrepresentation analysis utilizing hypergeometric distribution. In cases, where the data complexity is high and there are no published frameworks a researcher could follow, randomization can be an approach to exploit. In two studies by The mouse oocyte proteome escapes maternal aging and CellFateScout - a bioinformatics tool for elucidating small molecule signaling pathways that drive cells in a specific direction we showed how randomization can be performed for distinct complex tasks. In The mouse oocyte proteome escapes maternal aging we constructed a random sample of semantic similarity score between oocyte transcriptome and random transcriptome subset of oocyte proteome size. Therefore, we could calculate whether the proteome is representative of the trancriptome. Further, we established a novel framework for Gene Ontology overrepresentation that involves randomization testing. Every Gene Ontology term is tested whether randomly reassigning all gene labels of belonging to or not belonging to this term will decrease the overall expression level in this term. In CellFateScout - a bioinformatics tool for elucidating small molecule signaling pathways that drive cells in a specific direction we validated CellFateScout against other well-known bioinformatics tools. We stated the question whether our plugin is able to predict small molecule effects better in terms of expression signatures. For this, we constructed a protocol that uses randomization testing. We assess here if the small molecule effect described as a (set of) active signaling pathways, as detected by our plugin or other bioinformatics tools, is significantly closer to known small molecule targets than a random path.
Die dem Leben zugrundeliegenden Prozesse sind hochkomplex. Sie werden zu einem Großteil durch Proteine umgesetzt. Diese spielen eine tragende Rolle für die morphologische Struktur und Vielfalt sowie Spezifität der Fähigkeiten der verschiedenen Zelltypen. Jedoch wirken Proteine nicht isoliert für sich allein sondern indem sie miteinander oder mit anderen Molekülen in der Zelle (DNA, Metabolite, Signalstoffe etc.) wechselwirken. Gerät dieses Geflecht von aufeinander abgestimmten Wechselwirkungen aus dem Gleichgewicht, kann das eine Ursache für Erkrankungen sein. Die Kenntnis über fehlregulierte Interaktionen kann dabei helfen, die betreffende Krankheit besser zu verstehen und gegen sie zu intervenieren. Die vorliegende Dissertation beschäftigt sich mit der Identifizierung von solch differentiell regulierten Interaktionen. Im Rahmen der Arbeit wurde eine Methode mit dem Namen ExprEssence entwickelt, welche diejenigen Interaktionen in einem Protein-Protein-Interaktionsnetzwerk identifiziert, die sich zwischen zwei verglichenen Zuständen (z.B. krank versus gesund) am stärksten unterscheiden. Ziel ist es, das Netzwerk auf die wesentlichen Unterschiede zwischen den zwei untersuchten Zuständen zu reduzieren. Hierzu werden Genexpressions- oder Proteomdaten der beiden Zustände in das bereits bestehende Netzwerk integriert. Aus diesen Daten wird die Stärke/Häufigkeit des Auftretens der einzelnen Interaktionen des Netzwerks geschätzt. Die Interaktionen, deren Interaktionsstärken sich zwischen den betrachteten Zuständen am stärksten unterscheiden, werden beibehalten – die restlichen Interaktionen werden verworfen. Dies ergibt ein verkleinertes Subnetzwerk, das aus jenen Interaktionen besteht, die am stärksten differentiell reguliert sind. Diese Interaktionen und ihre Proteine sind Kandidaten für eine Erklärung der biologischen Unterschiede der betrachteten Zustände auf molekularem Niveau. Die Methode wurde auf verschiedene biologische Fragestellungen angewandt und mit anderen ähnlichen Methoden verglichen. Bei der Untersuchung der Unterschiede zwischen Erfolg und Misserfolg einer chemotherapeutischen Brustkrebstherapie konnte beispielsweise gezeigt werden, dass das mit ExprEssence erstellte Subnetzwerk einen stärkeren Bezug zu den bereits bekannten Therapieerfolg-relevanten Mechanismen aufweist als die Methoden, mit denen ExprEssence verglichen wurde. Weiterhin wurde im Subnetzwerk eine möglicherweise für den Therapieerfolg relevante Interaktion identifiziert, die in diesem Zusammenhang bisher nicht betrachtet wurde. Deren Bedeutung konnte in der experimentellen Nachverfolgung weiter untermauert werden. Einen weiteren Schwerpunkt der Arbeit bildete die Untersuchung des Interaktoms eines spezialisierten Zelltyps der Niere – des Podozyten. Dieser Zelltyp ist essentiell für die Filtrationskompetenz der Niere. Ein Interaktionsnetzwerk mit spezifisch für den Podozyten relevanten Interaktion gib es bisher nicht. Daher wurde ein Podozyten-spezifisches Protein-Protein-Interaktionsnetzwerk aus wissenschaftlichen Veröffentlichungen zusammengestellt und öffentlich verfügbar gemacht. Genexpressionsdaten vielfältiger Art, beispielsweise von Podozyten in verschiedenen Entwicklungsstadien oder in Zellkultur, wurden in das Netzwerk integriert und mit ExprEssence analysiert. So konnte beispielsweise gezeigt werden, dass die Dedifferenzierung von in Kultur gehaltenen Podozyten nicht dem Umkehrweg der zuvor durchlaufenen Differenzierung entspricht. Neben ExprEssence wurde weitere Software entwickelt, die die Anwendbarkeit von ExprEssence erweitert – MovieMaker und ExprEsSector. Mit MovieMaker werden die Übergänge zwischen den betrachteten Zuständen nachvollziehbarer visualisiert. ExprEsSector bildet die Vereinigungs- und Schnittmengen-Netzwerke von ExprEssence-Subnetzwerken. So können beispielsweise verschiedenen Krankheiten gemeinsame Veränderungen vom Normalzustand identifiziert werden. Ist für eine Krankheit bereits ein Therapieansatz vorhanden, der auf eine fehlregulierte Interaktion einwirkt, und ist diese Interaktion auch in der anderen Krankheit gleichartig differentiell reguliert, kann geprüft werden, ob diese Therapie auf die zweite Krankheit übertragen werden kann. Neben der Vorstellung und Diskussion der erzielten Ergebnisse, wird auch auf methodisch bedingte Nachteile eingegangen. Es werden Strategien aufgezeigt, wie die negativen Einflüsse möglichst minimiert werden können oder wie sie bei der Bewertung der Ergebnisse zu berücksichtigen sind. In Anbetracht der immer schneller ansteigenden Menge biologischer Daten ist es eine wesentliche Herausforderung geworden, aus diesen die essentiellen Informationen zu extrahieren. Der integrative Ansatz der Verknüpfung von Informationen verschiedener Quellen wurde mit ExprEssence und den Erweiterungen MovieMaker und ExprEsSector in einem Konzept zur Identifizierung zustandsrelevanter molekularer Mechanismen in intuitiv leicht erfassbarer Form umgesetzt.