Diacritics correction in Turkish with context-aware sequence to sequence modeling

Köksal, Asiye; Bozal, Özge; ÖZGE, UMUT

doi:10.55730/1300-0632.3948

Diacritics correction in Turkish with context-aware sequence to sequence modeling

Turkish Journal of Electrical Engineering and Computer Sciences, cilt.30, sa.6, ss.2433-2445, 2022 (SCI-Expanded, Scopus, TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 30 Sayı: 6
Basım Tarihi: 2022
Doi Numarası: 10.55730/1300-0632.3948
Dergi Adı: Turkish Journal of Electrical Engineering and Computer Sciences
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.2433-2445
Anahtar Kelimeler: Natural language processing, diacritics restoration, diacritics correction, sequence to sequence learning, LSTM, RESTORATION
Açık Arşiv Koleksiyonu: AVESİS Açık Erişim Koleksiyonu
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

© TÜBITAK.Digital texts in many languages have examples of missing or misused diacritics which makes it hard for natural language processing applications to disambiguate the meaning of words. Therefore, diacritics restoration is a crucial step in natural language processing applications for many languages. In this study we approach this problem as bidirectional transformation of diacritical letters and their ASCII counterparts, rather than unidirectional diacritic restoration. We propose a context-aware character-level sequence to sequence model for this transformation. The model is language independent in the sense that no language-specific feature extraction is necessary other than the utilization of word embeddings and is directly applicable to other languages. We trained the model for Turkish diacritics correction task and for the assessment we used Turkish tweets benchmark dataset. Our best setting for the proposed model improves the state-of-the-art results in terms of F1 score by 4.7% on ambiguous words and 1.24% over all cases.