Similarity search in protein sequence databases using metric access methods


Cetintas A., Sacan A., TOROSLU İ. H.

5th International Conference on Bioinformatics and Computational Biology 2013, BICoB 2013, Honolulu, HI, United States Of America, 4 - 06 March 2013, pp.131-136 identifier

  • Publication Type: Conference Paper / Full Text
  • Volume:
  • City: Honolulu, HI
  • Country: United States Of America
  • Page Numbers: pp.131-136

Abstract

The rapid increase in the size of biological sequence data owing to the advancements in high-throughput sequencing techniques, and the increased complexity of hypothesis-driven exploration of this data requiring massive number of similarity queries call for new approaches for managing sequence databases and analysis of this information. The metric space representation for sequences is suitable for similarity search and provides several sophisticated metric-indexing techniques. In this work, we provide a thorough survey and analysis of the application of metric access methods to similarity search in protein sequence databases. A framework supporting application of different metric space indexing methods is developed and a non-redundant sequence database is used to benchmark different methods in terms of number of distance-computations incurred and the computation time required during database compilation and query phases. The parameters of each method are optimized on a subset of experimental conditions. We demonstrate that Onion-Tree, a hybrid metric access method, performs the best in both index building and querying phases for the protein database investigated, and scales well for large databases, incurring distance computations with 0.5% of the database sequences per query.