A database query based solution for chemical compound and drug name recognition


Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2014

Öğrenci: ÇAĞLAR ATA

Danışman: TOLGA CAN

Özet:

Searching structured information in unstructured free text is one of the most difficult challenges in computer science. Relevant information from documents has to be ready for use not only with accurate precision but also be ready in a fast manner. Although numerous studies on document searching has been published, only few of them specifically target chemical compound and drug names. Chemical compound and drug names have specific morphological properties. These unique morphological properties have to be examined before developing automatic text searching methods. These properties should also be integrated into chemical compound and drug name retrieval systems. In this thesis, we focus on named entity recognition problem with a newly proposed method on chemical compound and drug name recognition model using queries on a very domain specific database. PubChem Power User Gateway (PUG) system is used as the main database for this specific domain to demonstrate the method. Chemical compound and drug name grammar and morphological properties are used as base for constructing the model. These features are deeply examined and used for optimizing the queries and increase the recall with precision on finding relevant chemical compound and drug names in documents. This new proposed method also presents a unique chemical compound and drug name tokenizer designed for specifically tokenizing chemical words in an article. The proposed method is applied on significant amount of chemical compound and drug name containing documents. Results of our proposed method are compared against the state of the art methods that target the same problem.