Thesis Type: Postgraduate
Institution Of The Thesis: Middle East Technical University, Turkey
Approval Date: 2020
Student: Fatma Cankara
Consultant: NURCAN TUNÇBAĞAbstract:
Whole-genome and exome sequencing studies have indicated that genomic variations may cause deleterious effects on protein functionality via various mechanisms. Single nucleotide variations that alter the protein sequence, and thus, the structure and the function, namely non-synonymous SNPs (nsSNP), are associated with many genetic diseases in human. The current rate of manually annotating the reported nsSNPs cannot catch up with the rate of producing new sequencing data. To aid this process, automated computational approaches are being developed and applied on the unknown data. In this study, we propose a new methodology to collect and organize the information related to the effects of nsSNPs at the amino acid sequence level from various biological databases and to utilize this information in a supervised machine-learning based system to predict the function disrupting capacities of mutations with unknown consequences. For this, 157,138 annotated mutation data points (89,363 deleterious and 67,775 neutral) were collected from multiple resources such as UniProt, ClinVar and Protein Mutant Database. For each mutation data point, a feature vector was constructed using protein 3-D structure information and site-specific feature annotations in the UniProt database. The information about the spatial proximity of the reported mutations to these protein features were also incorporated to the feature vector. The system was trained with these feature vectors and their respective labels in a supervised fashion using random forest, where the ultimate aim was to construct a model that classifies unknown mutations either as deleterious or neutral. The prediction model was evaluated in detail to observe the contribution of different feature types to the prediction success. The finalized model displayed a satisfactory performance (AUROC:0.86, precision: 0.77, recall 0:90, accuracy: 0.78, F1-score: 0.83 and MCC: 0.54) on the independent test dataset. Besides, the performance of the proposed model was compared to the widely used variant effect predictors in the literature, over standard benchmark datasets. As future work, we plan to conduct a case study over interesting prediction examples and to validate our results via literature-based information. Finally, we plan to construct a ready-to-use command line based variant effect prediction tool and to share it with the research community over an open access data repository. We believe that this system will be complementary to the well-known methods in the literature and its incorporation to ensemble-based tools will increase the performance of the state-of-the-art in variant effect prediction.