Natural Language Processing for the Turkish Academic Texts in the Engineering Field: Key-Term Extraction, Similarity Detection, Subject/Topic Assignment

19th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2023, Leon, İspanya, 14 - 17 Haziran 2023, cilt.676 IFIP, ss.411-424

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 676 IFIP
Doi Numarası: 10.1007/978-3-031-34107-6_33
Basıldığı Şehir: Leon
Basıldığı Ülke: İspanya
Sayfa Sayıları: ss.411-424
Anahtar Kelimeler: Conceptual similarity, Feature extraction, Key term extraction, Natural language processing (NLP), Naïve Bayes classifier, subject/topic assignment, Supervised machine learning, TÜBİTAK
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

The information retrieved from texts plays crucial roles in many aspects. Although there are significant attempts on natural language processing for various types of texts in Turkish, none of them deals with academic texts. This study mainly aims to retrieve precise key terms from Turkish academic texts in the field of engineering and develops algorithms for similarity detection and automatic classification based on these key terms. In the first step of this study: a library and customized templates, that can transform the n-grams into structured forms, are created by considering the features of engineering terminology and the grammar of the Turkish language. Then, a customized similarity detection algorithm is developed. Finally, the Naïve Bayes Classifier is used to assign the documents to the appropriate engineering sub-fields. The project proposals submitted to The Scientific and Technological Research Council of Turkey (TÜBİTAK) Academic Research Funding Program Directorate (ARDEB) are analyzed as a case study. The results indicate that the proposed similarity algorithm correctly detects almost all of the re-submitted proposals while the accuracy of the classifier is 83.3% in the first prediction and reaches up to 96.4% in the first three predictions over a sample of 1255 proposals.