A hybrid method for toponym recognition on informal Turkish text


Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Mühendislik Fakültesi, Bilgisayar Mühendisliği Bölümü, Türkiye

Tezin Onay Tarihi: 2014

Öğrenci: MERYEM KILINÇ

Danışman: PINAR KARAGÖZ

Özet:

Since accessing the Internet is getting easier and people are more willing to share information on the Internet than the previous generations, the data on such kind of reachable sources are growing very rapidly day by day. Moreover, because of the popularity and widely usage of those sources, the information which researchers and organizations are interested in can be found somewhere in these data collection. The purpose of Information Extraction (IE) is to analyze this information cloud and to extract the desired data among them. This study designs a system dealing with a subfield of Information Extraction, namely, Named Entity Recognition (NER), which many of the IE systems use as a basis. NER is used to identify the entities related to the aspired information in texts and classify them into a set of predefined categories such as person, location, and organization names, date and money expressions, etc. Since most of the desired information such as trends, agendas, needs and thoughts of people may vary among locations and a location name can be used for more than one location, extracting location information is another research area. There is a field for this purpose, named as Toponym Extraction, which uses NER as a basic step in order to recognize location names. Toponym Extraction consists of two steps, namely Toponym Recognition and Toponym Resolution. The first step, Toponym Recognition, is the subject of the proposed study. It aims to extract named entities referring to location names; whereas, Toponym Resolution aims to make decision about which geographical coordinate the entity refers to; since, a location name can be used for more than one geographical coordinates. Prominence of social media such as Twitter and Facebook have drawn attention from companies and researchers interested in detecting trends; however, the informal and popular nature of these services leads to a large amount of noisy misspellings, lack of punctuation, non-standard abbreviations and abnormal capitalization which make the recognition process really hard. This case creates a new challenge in NER field; thus, it also creates a new challenge in Toponym Recognition. The proposed system in this thesis, constructs a hybrid NER system which uses both rule based and machine learning based techniques to extract toponyms from an informally written, unstructured text document which includes Turkish tweets. In this study, Conditional Random Fields (CRF) is used as a machine learning tool and some features such as POS-Tags and Conjunction Window are defined to train the constructed CRF model. In the rule based part, regular expressions which aim to define some rules in order to extract some words that containing "köy", "deniz", " ̧ehir", "istan", etc. are used. The result of the rule based part is used as a feature in the machine learning part. All defined features are experimented interchangeably and incrementally. In addition, various learning mechanisms within CRF are compared in terms of their accuracy. Finally, the proposed study shows the effect of the size of the training and test data sets on the system accuracy. Those parameters are all experimented and the combination giving the best result is used in the comparison part in which the system is compared with some previous studies.