A hybrid named entity recognizer for Turkish


Kucuk D., YAZICI A.

EXPERT SYSTEMS WITH APPLICATIONS, cilt.39, sa.3, ss.2733-2742, 2012 (SCI-Expanded) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 39 Sayı: 3
  • Basım Tarihi: 2012
  • Doi Numarası: 10.1016/j.eswa.2011.08.131
  • Dergi Adı: EXPERT SYSTEMS WITH APPLICATIONS
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
  • Sayfa Sayıları: ss.2733-2742
  • Anahtar Kelimeler: Information extraction, Hybrid named entity recognizer, Named entity recognition in Turkish, INFORMATION EXTRACTION
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Named entity recognition is an important subfield of the broader research area of information extraction from textual data. Yet, named entity recognition research conducted on Turkish texts is still rare as compared to related research carried out on other languages such as English, Spanish, Chinese, and Japanese. In this study, we present a hybrid named entity recognizer for Turkish, which is based on a manually engineered rule based recognizer that we have proposed. Since rule based systems for specific domains require their knowledge sources to be manually revised when ported to other domains, we enrich our rule based recognizer and turn it into a hybrid recognizer so that it learns from annotated data when available and improves its knowledge sources accordingly. The hybrid recognizer is originally engineered for generic news texts, but with its learning capability, it is improved to be applicable to that of financial news texts, historical texts, and child stories as well, without human intervention. Both the hybrid recognizer and its rule based predecessor are evaluated on the same corpora and the hybrid recognizer achieves better results as compared to its predecessor. The proposed hybrid named entity recognizer is significant since it is the first hybrid recognizer proposal for Turkish addressing the above porting problem considering that Turkish possesses different structural properties compared to widely studied languages such as English and there is very limited information extraction research conducted on Turkish texts. Moreover, the employment of the proposed hybrid recognizer for semantic video indexing is shown as a case study on Turkish news videos. The genuine textual and video corpora utilized throughout the paper are compiled and annotated by the authors due to the lack of publicly available annotated corpora for information extraction research on Turkish texts. (C) 2011 Elsevier Ltd. All rights reserved.