An extension to GOPred to annotate Swiss-Prot and Trembl sequences for all gene ontology categories and EC numbers


Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2015

Öğrenci: AHMET SÜREYYA RİFAİOĞLU

Eş Danışman: MEHMET VOLKAN ATALAY, RENGÜL ATALAY

Özet:

Traditional protein function annotation methods cannot keep up with annotation of proteins as the number of proteins whose sequences known is increasing exponentially. For this reason, protein function prediction became an important research area. In this thesis, GOPred method is used with improvements for protein function prediction problem. GOPred consists of SPMap, Blast-kNN and Pepstats methods which are subsequence, similarity and feature based methods, respectively. Previous version of GOPred method used for functional classification of proteins based on 300 molecular function Gene Ontology (GO) terms. In this study, improved system is trained for 514 molecular function, 2909 biological process and 438 cellular component GO terms. The system is also applied on functional prediction of enzymes based on 851 Enzyme Commission (EC) Numbers. Hierarchical evaluation of predictions is proposed to give reliable predictions for EC numbers. In addition, we used a new method to calculate optimal decision thresholds for each functional term to determine the predictions that will be given. Optimal thresholds are calculated for each functional term and predictions whose scores are over determined optimal thresholds are presented. Performances of functional terms are measured separately and averages of performances are calculated to evaluate the system. GO term prediction results show that performance of our system is better for prediction of multi-functional proteins. To the best of our knowledge, this is the best performance achieved for EC number prediction in the literature. Improved system is tested on about 58 million TrEMBL proteins to compare predictions that are given by our system with the reference systems that give annotations for TrEMBL database which are EMBL, HAMAP, PDB, PIR, PIRNR and RuleBase. Results show that, most of the predictions that are given by our system are consistent with the predictions that are given by other systems.