An Efficient Part-of-Speech Tagger for Arabic

Kopru S.

12th Annual Conference on Intelligent Text Processing and Computational Linguistics, Tokyo, Japonya, 20 - 26 Şubat 2011, cilt.6608, ss.202-213, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 6608
Basıldığı Şehir: Tokyo
Basıldığı Ülke: Japonya
Sayfa Sayıları: ss.202-213
Orta Doğu Teknik Üniversitesi Adresli: Hayır

Özet

In this paper, we present an efficient part-of-speech (POS) tagger for Arabic which is based on a Hidden Markow Model. We explore different enhancements to improve the baseline system. Despite the morphological complexity of Arabic our approach is a data driven approach and does not utilize any morphological analyzer or a lexicon as many other Arabic PUS taggers. This makes our approach simple, very efficient and valuable to be used in real-life applications and the obtained accuracy results are still comparable to other Arabic POS taggers. In the experiments, we also thoroughly investigate different aspects of Arabic PUS tagging including tag sets, prefix and suffix analyses which were not examined in detail before. Our part-of-speech tagger achieves an accuracy of 95.57% on a standard tagset for Arabic. A detailed error analysis is provided for a better evaluation of the system. We also applied the same approach on different languages like Farsi and German to show the language independent aspect of the approach. Accuracy rates on these languages are also provided.