Named Entity Recognition Experiments on Turkish Texts


Kuecuek D., YAZICI A.

8th International Conference on Flexible Query Answering Systems, Roskilde, Danimarka, 26 - 28 Ekim 2009, cilt.5822, ss.524-535 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Cilt numarası: 5822
  • Basıldığı Şehir: Roskilde
  • Basıldığı Ülke: Danimarka
  • Sayfa Sayıları: ss.524-535
  • Anahtar Kelimeler: information extraction, named entity recognition, Turkish
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Named entity recognition (NER) is one of the main information extraction tasks and research on NER from Turkish texts is known to be rare. In this study, we present a rule-based NER system for Turkish which employs a set of lexical resources and pattern bases for the extraction of named entities including the names of people, locations, organizations together with time/date and money/percentage expressions. The domain of the system is news texts and it does not utilize important clues of capitalization and punctuation since they may be missing in texts obtained from the Web or the output of automatic speech recognition tools. The evaluation of the system is performed on news texts along with other genres encompassing child stories and historical texts, but as expected in case of manually engineered rule-based systems, it suffers from performance degradation on these latter genres of texts since they are distinct from the target domain of news texts. Furthermore, the system is evaluated on transcriptions of news videos leading to satisfactory results which is an important step towards the employment of NER during automatic semantic an notation of videos in Turkish. The current study is significant for its being the first rule-based approach to the NER task on Turkish texts with its evaluation on diverse text types.