Named entity recognition in Turkish with Bayesian learning and hybrid approaches


Thesis Type: Postgraduate

Institution Of The Thesis: Orta Doğu Teknik Üniversitesi, Faculty of Engineering, Department of Computer Engineering, Turkey

Approval Date: 2011

Student: SERMET REHA YAVUZ

Consultant: ADNAN YAZICI

Abstract:

Information Extraction (IE) is the process of extracting structured and important pieces of information from a set of unstructured text documents in natural language. The final goal of structured information extraction is to populate a database and reach data effectively. Our study focuses on named entity recognition (NER) which is an important subtask of IE. NER is the task that deals with extraction of named entities like person, location, organization names, temporal expressions (date and time) and numerical expressions (money and percent). NER research on Turkish is known to be rare. There are rule-based, learning based and hybrid systems for NER on Turkish texts. Some of the learning approaches used for NER in Turkish are conditional random fields (CRF), rote learning, rule extraction and generalization. In this thesis, we propose a learning based named entity recognizer for Turkish texts which employs a modified version of Bayesian learning as the learning scheme. To the best of our knowledge, this is the first learning based system that uses Bayesian approach for NER in Turkish. Several features (like token length, capitalization, lexical meaning, etc.) are used in the system to see the effects of different features on NER process. We also propose hybrid system where the Bayesian learning-based system is utilized along with a rule-based recognition system. There are two different versions of the hybrid system. Output of rule-based recognizer is utilized in different phases in these versions. We observed increase in F-Measure values for both hybrid versions. When partial scoring is active, hybrid system reached 91.44% F-Measure value; where rule-based system result is 87.43% and learning-based system result is 88.41%. The hybrid system can be improved by utilizing rule-based and learning-based components differently in the future. Hybrid system can also be improved by using different learning approaches and combining them with existing hybrid system or forming the hybrid system with a completely new approach.