A classification system for the problem of protein subcellular localization


Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2007

Öğrenci: GÖKÇEN ALAY

Eş Danışman: TOLGA CAN, MEHMET VOLKAN ATALAY

Özet:

The focus of this study is on predicting the subcellular localization of a protein. Subcellular localization information is important for protein function annotation which is a fundamental problem in computational biology. For this problem, a classification system is built that has two main parts: a predictor that is based on a feature mapping technique to extract biologically meaningful information from protein sequences and a client/server architecture for searching and predicting subcellular localizations. In the first part of the thesis, we describe a feature mapping technique based on frequent patterns. In the feature mapping technique we describe, frequent patterns in a protein sequence dataset were identified using a search technique based on a priori property and the distribution of these patterns over a new sample is used as a feature vector for classification. The effect of a number of feature selection methods on the classification performance is investigated and the best one is applied. The method is assessed on the subcellular localization prediction problem with 4 compartments (Endoplasmic reticulum (ER) targeted, cytosolic, mitochondrial, and nuclear) and the dataset is the same used in P2SL. Our method improved the overall accuracy to 91.71\% which was originally 81.96\% by P2SL. In the second part of the thesis, a client/server architecture is designed and implemented based on Simple Object Access Protocol (SOAP) technology which provides a user-friendly interface for accessing the protein subcellular localization predictions. Client part is in fact a Cytoscape plug-in that is used for functional enrichment of biological networks. Instead of the individual use of subcellular localization information, this plug-in lets biologists to analyze a set of genes/proteins under system view.