Refine
Document Type
- Doctoral Thesis (6)
Language
- English (6) (remove)
Has Fulltext
- yes (6)
Is part of the Bibliography
- no (6)
Keywords
- Statistik (6) (remove)
In this thesis, we elaborate upon Bayesian changepoint analysis, whereby our focus is on three big topics: approximate sampling via MCMC, exact inference and uncertainty quantification. Besides, modeling matters are discussed in an ongoing fashion. Our findings are underpinned through several changepoint examples with a focus on a well-log drilling data.
Aiming at the goal of individualized medicine, this dissertation develops a generic methodology to individualize risk factors and phenotypes via metabolomic data from the urine. As metabolomic data can be seen as a holistic representation of the metabolism of an organism at certain time point, metabolomic data contain not only information about current life-style factors like diet and smoking but also about latent genetic traits. Utilizing this integrative attribute, the dissertation delivers a metric for biological age (the metabolic age score) which was shown to be informative beyond chronological age in three independent samples. It was associated with a broad range of age-related comorbidities in two large population-based cohorts, predicted independently of classical risk factors mortality and, moreover, it predicted weight loss subsequently to bariatric surgery in a small sample of heavily obese individuals.
Subsequently to this work, the dissertation built a definitional framework justifying the procedure underlying the metabolic age score, delivering a general framework for the construction of individualized phenotypes and thereby an operationalization of individualization in statistical terms. Conceptualizing individualization of the process of differentiation of individuals showing the same phenotype despite different underlying biological traits, it was shown formally that the prediction error of a statistical model approximating a phenotype is always informative about the underlying biology beyond the phenotype if the predictors fulfill certain statistical requirements. Thus, the prediction error facilitates the meaningful differentiation of individuals showing the same phenotype. The definitional framework presented here is not restricted to any kind of data and is therefore applicable to a broad range of medical research questions.
However, when utilizing metabolomic data, technical factors, data-preprocessing, pre-analytic features introduce unwanted variance into the statistical modeling. Thus, it is unclear whether predictive models like the metabolic age score are stable enough for clinical application. The third part of this doctoral thesis provided two statistical criteria to decide which normalization method to remove the dilution variance from urinary metabolome data performs best in terms of erroneous variance introduced by the different methods, aiding the minimization of biological irrelevant variance in metabolomic analyses.
In conclusion, this doctoral thesis developed a general, applicable, definitional framework for the construction of individualized phenotypes and demonstrated the value of the methodology for clinical phenotypes on metabolomic data, improving on the way the statistical treatment of urinary data regarding the dilution correction.
Psychiatric disorders are highly heritable. But the underlying molecular mechanisms are largely unknown or not understood. For many disorders, candidate genes have been proposed which are biologically driven or based on large GWAS studies. In this work different approaches were shown to investigate the impact of genetic risk factors for major psychiatric disorders in the general population. These genetic risk variants include single nucleotide polymorphisms associated with schizophrenia or major depression and were analyzed using the whole-genome information in polygenic scores or candidate marker analysis in GxE studies. Genetic data from SHIP-0 and SHIP-TREND have been used to calculate a polygenic risk score for schizophrenia. Here, the association between this genetic score and brain alterations is shown in three independent samples (SHIP-2, SHIP-TREND and BIG) which revealed no hint of a common genetic basis for schizophrenia and brain structure. These results are in line with other studies that also failed to find a genetic overlap. The same polygenic scores had been used in a PHEWAS analysis in SHIP-0 where an inverse association to migraine was found. This association could be attributed to the NMDA receptor activation via D-serine at the glutamatergic synapse. To assess the impact of environmental factors on the path from genes to phenotype, gene-environment interactions were applied. A significant interaction could be observed between rs7305115 (TPH2) and rs25531 (5-HTTLPR) and childhood abuse on current depression score in SHIP-LEGEND and SHIP-TREND. In summary, genetic variants associated with major psychiatric disorders can exhibit pleiotropic effects on common phenotypes in the general population.
Approaches to the Analysis of Proteomics and Transcriptomics Data based on Statistical Methodology
(2014)
Recent developments in genomics and molecular biology led to the generation of an enormous amount of complex data of different origin. This is demonstrated by a number of published results from microarray experiments in Gene Expression Omnibus. The number was growing in exponential pace over the last decade. The challenge of interpreting these vast amounts of data from different technologies led to the development of new methods in the fields of computational biology and bioinformatics. Researchers often want to represent biological phenomena in the most detailed and comprehensive way. However, due to the technological limitations and other factors like limited resources this is not always possible. On one hand, more detailed and comprehensive research generates data of high complexity that is very often difficult to approach analytically, however, giving bioinformatics a chance to draw more precise and deeper conclusions. On the other hand, for low-complexity tasks the data distribution is known and we can fit a mathematical model. Then, to infer from this mathematical model, researchers can use well-known and standard methodologies. In return for using standard methodologies, the biological questions we are answering might not be unveiling the whole complexity of the biological meaning. Nowadays it is a standard that a biological study involves generation of large amounts of data that needs to be analyzed with a statistical inference. Sometimes data challenge researchers with low complexity task that can be performed with standard and popular methodologies as in Proteomic analysis of mouse oocytes reveals 28 candidate factors of the "reprogrammome". There, we established a protocol for proteomics data that involves preprocessing of the raw data and conducting Gene Ontology overrepresentation analysis utilizing hypergeometric distribution. In cases, where the data complexity is high and there are no published frameworks a researcher could follow, randomization can be an approach to exploit. In two studies by The mouse oocyte proteome escapes maternal aging and CellFateScout - a bioinformatics tool for elucidating small molecule signaling pathways that drive cells in a specific direction we showed how randomization can be performed for distinct complex tasks. In The mouse oocyte proteome escapes maternal aging we constructed a random sample of semantic similarity score between oocyte transcriptome and random transcriptome subset of oocyte proteome size. Therefore, we could calculate whether the proteome is representative of the trancriptome. Further, we established a novel framework for Gene Ontology overrepresentation that involves randomization testing. Every Gene Ontology term is tested whether randomly reassigning all gene labels of belonging to or not belonging to this term will decrease the overall expression level in this term. In CellFateScout - a bioinformatics tool for elucidating small molecule signaling pathways that drive cells in a specific direction we validated CellFateScout against other well-known bioinformatics tools. We stated the question whether our plugin is able to predict small molecule effects better in terms of expression signatures. For this, we constructed a protocol that uses randomization testing. We assess here if the small molecule effect described as a (set of) active signaling pathways, as detected by our plugin or other bioinformatics tools, is significantly closer to known small molecule targets than a random path.
We introduce a multi-step machine learning approach and use it to classify data from EEG-based brain computer interfaces. This approach works very well for high-dimensional EEG data. First all features are divided into subgroups and linear discriminant analysis is used to obtain a score for each subgroup. Then it is applied to subgroups of the resulting scores. This procedure is iterated until there is only one score remaining and this one is used for classification. In this way we avoid estimation of the high-dimensional covariance matrix of all features. We investigate the classifification performance with special attention to the small sample size case. For the normal model, we study the asymptotic error rate when dimension p and sample size n tend to infinity. This indicates how to defifine the sizes of subgroups at each step. In addition we present a theoretical error bound for the spatio-temporal normal model with separable covariance matrix, which results in a recommendation on how subgroups should be formed for this kind of data. Finally some techniques, for example wavelets and independent component analysis, are used to extract features of some kind of EEG-based brain computer interface data.
Parsimonious Histograms
(2010)
The dissertation is concerned with the construction of data driven histograms. Histograms are the most elementary density estimators at all. However, they require the specification of the number and width of the bins. This thesis provides two new construction methods delivering adaptive histograms where the required parameters are determined automatically. Both methods follow the principle of parsimony, i.e. the histograms are solutions of predetermined optimization problems. In both cases, but under different aspects, the number of bins is minimized. The dissertation presents the algorithms that solve the optimization problems and illustrates them by a number of numerical experiments. Important properties of the estimators are shown. Finally, the new developed methods are compared with standard methods by an extensive simulation study. By means of synthetic samples of different size and distribution the histograms are evaluated by special performance criteria. As one main result, the proposed methods yield histograms with considerably fewer bins and with an excellent ability of peak detection.