Bitte verwenden Sie diesen Link, wenn Sie dieses Dokument zitieren oder verlinken wollen:

Approaches to the Analysis of Proteomics and Transcriptomics Data based on Statistical Methodology

  • Recent developments in genomics and molecular biology led to the generation of an enormous amount of complex data of different origin. This is demonstrated by a number of published results from microarray experiments in Gene Expression Omnibus. The number was growing in exponential pace over the last decade. The challenge of interpreting these vast amounts of data from different technologies led to the development of new methods in the fields of computational biology and bioinformatics. Researchers often want to represent biological phenomena in the most detailed and comprehensive way. However, due to the technological limitations and other factors like limited resources this is not always possible. On one hand, more detailed and comprehensive research generates data of high complexity that is very often difficult to approach analytically, however, giving bioinformatics a chance to draw more precise and deeper conclusions. On the other hand, for low-complexity tasks the data distribution is known and we can fit a mathematical model. Then, to infer from this mathematical model, researchers can use well-known and standard methodologies. In return for using standard methodologies, the biological questions we are answering might not be unveiling the whole complexity of the biological meaning. Nowadays it is a standard that a biological study involves generation of large amounts of data that needs to be analyzed with a statistical inference. Sometimes data challenge researchers with low complexity task that can be performed with standard and popular methodologies as in Proteomic analysis of mouse oocytes reveals 28 candidate factors of the "reprogrammome". There, we established a protocol for proteomics data that involves preprocessing of the raw data and conducting Gene Ontology overrepresentation analysis utilizing hypergeometric distribution. In cases, where the data complexity is high and there are no published frameworks a researcher could follow, randomization can be an approach to exploit. In two studies by The mouse oocyte proteome escapes maternal aging and CellFateScout - a bioinformatics tool for elucidating small molecule signaling pathways that drive cells in a specific direction we showed how randomization can be performed for distinct complex tasks. In The mouse oocyte proteome escapes maternal aging we constructed a random sample of semantic similarity score between oocyte transcriptome and random transcriptome subset of oocyte proteome size. Therefore, we could calculate whether the proteome is representative of the trancriptome. Further, we established a novel framework for Gene Ontology overrepresentation that involves randomization testing. Every Gene Ontology term is tested whether randomly reassigning all gene labels of belonging to or not belonging to this term will decrease the overall expression level in this term. In CellFateScout - a bioinformatics tool for elucidating small molecule signaling pathways that drive cells in a specific direction we validated CellFateScout against other well-known bioinformatics tools. We stated the question whether our plugin is able to predict small molecule effects better in terms of expression signatures. For this, we constructed a protocol that uses randomization testing. We assess here if the small molecule effect described as a (set of) active signaling pathways, as detected by our plugin or other bioinformatics tools, is significantly closer to known small molecule targets than a random path.
  • Heutzutage ist es Standard, dass biologische Studien große Mengen von Daten generieren, die mit Inferenzstatistik analysiert werden. Die statistische Analyse biologischer Daten in adäquater Auflösung wird anhand dreier Themenblöcke vorgestellt: "Proteomics analysis of mouse oocyte reveals 28 candidate factors of the „reprogrammome“", "The mouse oocyte proteome escapes maternal aging" und "CellFateScout – a bioinformatics tool for elucidating small molecule signalling pathways that drive cells in a specific direction". Diverse Ansätze wie statistische Tests, Randomisierung und gemischte lineare Modelle werden zur detaillierten Analyse von Proteomik- und Transkriptomik-Daten genutzt, und es werden Kriterien aufgezeigt, nach welchen statistische Ansätze ausgewählt werden sollten, unter Berücksichtigung der Komplexität der Daten.

Download full text files

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Author: Marcin Siatkowski
Title Additional (English):Approaches to the Analysis of Proteomics and Transcriptomics Data based on Statistical Methodology
Title Additional (German):Statistik-basierte Ansätze zur Analyse von Proteomik- und Transkriptomik-Daten
Advisor:Prof. Dr. Eckhardt Wolf, Prof. Dr. Volkmar Liebscher
Document Type:Doctoral Thesis
Date of Publication (online):2014/04/24
Granting Institution:Ernst-Moritz-Arndt-Universität, Mathematisch-Naturwissenschaftliche Fakultät (bis 31.05.2018)
Date of final exam:2014/04/11
Release Date:2014/04/24
Tag:Bioinformatik, Randomisierung, Statistik
GND Keyword:aging, bioinformatics, drug, mRNA, microarray, oocyte, protein, proteome, randomization, reprogramming, small molecule, statistics
Faculties:Mathematisch-Naturwissenschaftliche Fakultät / Institut für Mathematik und Informatik
DDC class:000 Informatik, Informationswissenschaft, allgemeine Werke / 000 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
MSC-Classification:62-XX STATISTICS / 62Pxx Applications [See also 90-XX, 91-XX, 92-XX] / 62P10 Applications to biology and medical sciences