Power of frequencies: N-grams and semi-supervised morphological segmentation in Turkish

Thesis Type: Doctorate

Institution Of The Thesis: Orta Doğu Teknik Üniversitesi, Graduate School of Informatics, Cognitive Science, Turkey

Approval Date: 2013




Turkish is an agglutinating language with a non-rigid word order. When communicating, the word internal structure in Turkish is required to be segmented because Turkish morphosyntax is tortuous and it plays a central role in semantic analysis. Distinguishing a sub-word unit actually means performing a morph segmentation task, which is accomplished by children at an astonishing success rate. In this study, morph segmentation of Turkish words was demonstrated with a semi-supervised Hidden Markov Model, which emphasized the power of frequencies and sequences as direct (or indirect negative) evidence for language acquisition. The method achieved .88, .92 and .90 (precision, recall and f-score) measures after being trained by the METU Corpus and the METU-Sabancı Turkish Treebank. Additionally, statistical approaches were offered for compound word recognition and segmentation. In order to corroborate the use of frequencies in the cognitive studies, the experimental studies and the corresponding statistical models in Turkish emphatic reduplication and the acceptability of nonce words were also proposed in this study. This study shows that since the probability mass in child-directed speech is skewed toward possible word forms and unlikely morph sequences, this mass can be used by various models to mimic human-level linguistic capabilities. Furthermore, human beings have a statistical learning ability and it is not specific to the faculty of language as claimed by nativists but to general cognition. This allows the plausible and valid use of computational and statistical models to analyze language. Such predictive models can allow a deeper understanding of language.