Using ZipF frequencies as a representativeness measure in statistical active learning of natural language


Tezin Türü: Yüksek Lisans

Tezin Yürütüldüğü Kurum: Orta Doğu Teknik Üniversitesi, Enformatik Enstitüsü, Bilişsel Bilimler Anabilim Dalı, Türkiye

Tezin Onay Tarihi: 2008

Öğrenci: ONUR ÇOBANOĞLU

Danışman: HÜSEYİN CEM BOZŞAHİN

Özet:

Active learning has proven to be a successful strategy in quick development of corpora to be used in statistical induction of natural language. A vast majority of studies in this field has concentrated on finding and testing various informativeness measures for samples; however, representativeness measures for samples have not been thoroughly studied. In this thesis, we introduce a novel representativeness measure which is, being based on Zipf's law, model-independent and validated both theoretically and empirically. Experiments conducted on WSJ corpus with a wide-coverage parser show that our representativeness measure leads to better performance than previously introduced representativeness measures when used with most of the known informativeness measures.