Tools and techniques for assessing functional relevance of genomic loci


Tezin Türü: Doktora

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2017

Öğrenci: BURÇAK OTLU SARITAŞ

Danışman: TOLGA CAN

Özet:

Genomic studies identify genomic loci representing genetic variations, transcription factor occupancy, or histone modification through next generation sequencing (NGS) technologies. Interpreting these loci requires evaluating them with known genomic and epigenomic annotations. In this thesis, we develop tools and techniques to assess the functional relevance of set of genomic intervals. Towards this goal, we first introduce Genomic Loci ANnotation and Enrichment Tool (GLANET) as a comprehensive annotation and enrichment analysis tool. Input query to GLANET is a set of genomic intervals. GLANET annotates and performs enrichment analysis on these loci with a rich library that includes: (i) gene-centric regions that encompass their non-coding neighborhood, (ii) a large collection of regulatory regions from ENCODE, and (iii) gene sets derived from pathways. As a key feature, users can easily extend this library with new gene sets and genomic intervals. GLANET implements a sampling-based enrichment test that can account for GC content and/or mappability biases inherent to NGS technologies, which shows high statistical power and well-controlled Type-I error rate. Other key features of GLANET include assessment of impact of single nucleotide variants on transcription factor binding sites when input consists of SNPs only and not only exon based but also regulation based gene set enrichment analysis by considering introns and proximal regions of genes in a gene set. GLANET also allows joint enrichment analysis for TF binding sites and KEGG pathways. With this option, users can evaluate whether the input set is enriched concurrently with binding sites of TFs and the genes within a KEGG pathway. This joint enrichment analysis provides a detailed functional interpretation of the input loci. As a second contribution we designed novel data-driven computational experiments for assessing the power and Type-I error of enrichment procedures. The data-driven computational experiments render detailed quantitative comparisons of GLANET with other tools possible. Our results on these computational experiments showcase GLANET’s unique capabilities as well as robustness, speed and accuracy. Finally, as a third contribution, we present an efficient algorithmic solution for finding common overlapping intervals over n interval sets. Our strategy is based on constructing one segment tree for each interval set as the first step and proceeds by converting each segment tree to an indexed segment tree forest by cutting this tree at a certain depth. Experiments on real data show that this data structure decreases the search time. This novel representation also enables parallel computations on each segment tree in the forest. We also extend this solution to solve the problem of finding at least k common overlapping intervals over n interval sets. The tools and techniques developed herein will hopefully expedite the genomic research and help improve our understanding of the molecular biology of the cell and the mechanisms underlying diseases.