Approximate similarity search in genomic sequence databases using landmark-guided embedding


Sacan A., TOROSLU İ. H.

24th IEEE International Conference on Data Engineering/ 1st International Workshop on Secure Semantic Web, Cancun, Meksika, 7 - 12 Nisan 2008, ss.498-499 identifier identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/icdew.2008.4498343
  • Basıldığı Şehir: Cancun
  • Basıldığı Ülke: Meksika
  • Sayfa Sayıları: ss.498-499
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Similarity search in sequence databases is ofparamount importance in bioinformatics research. As the size of the genomic databases increases, similarity search of proteins in these databases becomes a bottle-neck in large-scale studies, calling for more efficient methods of content-based retrieval. In this study, we present a metric-preserving, landmark-guided embedding approach to represent sequences in the vector domain in order to allow efficient indexing and similarity search. We analyze various properties of the embedding and show that the approximation achieved by the embedded representation is sufficient to achieve biologically relevant results. The approximate representation is shown to provide several orders of magnitude speed-up in similarity search compared to the exact representation, while maintaining comparable search accuracy.