Investigation of the impacts of linkage disequilibrium on SNP selection studies


Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Endüstri Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2015

Öğrenci: EKİN KANTAR ÖZÇIRPAN

Danışman: CEM İYİGÜN

Özet:

In many Genome Wide Association Studies (GWAS), the relation between SNPs and complex diseases has being tried to reveal. Moreover it is known that, in GWAS there exist a high amount of data which include relations between SNPs, phenotypes and diseases, etc. Many algorithms have been used to be able to reach the desired information from this huge data. Therefore, in this study, an algorithm one of whose important steps is based on linkage disequilibrium(LD), was constructed to eliminate the redundant information from the high-dimensional data. The algorithm improved in this study has been tested on prostate cancer data set downloaded from dbGaP. In order to find disease related SNPs in GWAS in a more effective way, we have constructed an algorithm which is based on LD. The web tool called SNAP (SNP Annotation and Proxy Search) was used to obtain the SNPs in the region of LD, which was determined based on the specific threshold value for {u1D45F}2. This value was selected as 0.5. After obtaining a modified version of original data set based on LD, Using Fisher’s Combination Method, we have obtained associated combined p values for each SNP in this data set. Then using SNPnexus database, we tried to achieve disease related SNPs from both data sets which are the original and modified ones. Thus both of the performances being applied on these data sets were evaluated relative to each other. Moreover, after eliminating the redundant data we have applied SNPnexus analysis again and then the results have shown us, by using approximately half of the SNPs, we were able to achieve the desired genes. Besides all of them also random forest algorithm was performed on the data set including SNPs with individual p values and the modified data set which is including SNPs with combined p values. The outputs of both performances were compared. In addition, one more purpose of this study, being able to reach the most important regulatory SNPs (rSNPs) in GWAS. Based on the data set which was modified using LD, we have focused on the non-coding SNPs, which are located on noncoding regions, through the whole genome. In conclusion, the number of important regulatory SNPs that were found from the modified data set, is much higher than we have found before by using original data set., it is expected from this thesis is that, the studies which have been conducted on prioritization of disease related SNPs are being effected by linkage disequilibrium(LD).