Prediction of protein-protein interactions from sequence using evolutionary relations of proteins and species


Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2009

Öğrenci: TACETTİN DOĞACAN GÜNEY

Danışman: TOLGA CAN

Özet:

Prediction of protein-protein interactions is an important part in understanding the biological processes in a living cell. There are completely sequenced organisms that do not yet have experimentally verified protein-protein interaction networks. For such organisms, we can not generally use a supervised method, where a portion of the protein-protein interaction network is used as training set. Furthermore, for newly-sequenced organisms, many other data sources, such as gene expression data and gene ontology annotations, that are used to identify protein-protein interaction networks may not be available. In this thesis work, our aim is to identify and cluster likely protein-protein interaction pairs using only sequence of proteins and evolutionary information. We use a protein’s phylogenetic profile because the co-evolutionary pressure hypothesis suggests that proteins with similar phylogenetic profiles are likely to interact. We also divide phylogenetic profile into smaller profiles based on the evolutionary lines. These divided profiles are then used to score the similarity between all possible protein pairs. Since not all profile groups have the same number of elements, it is a difficult task to assess the similarity between such pairs. We show that many commonly used measures do not work well and that the end result greatly depends on the type of the similarity measure used. We also introduce a novel similarity measure. The resulting dense putative interaction network contains many false-positive interactions, therefore we apply the Markov Clustering algorithm to cluster the protein-protein interaction network and filter out the weaker edges. The end result is a set of clusters where proteins within the clusters are likely to be functionally linked and to interact. While this method does not perform as well as supervised methods, it has the advantage of not requiring a training set and being able to work only using sequence data and evolutionary information. So it can be used as a first step in identifying protein-protein interactions in newly-sequenced organisms.