Natural Language Processing for the Turkish Academic Texts in the Engineering Field: Key-Term Extraction, Similarity Detection, Subject/Topic Assignment


19th IFIP WG 12.5 International Conference on Artificial Intelligence Applications and Innovations, AIAI 2023, Leon, Spain, 14 - 17 June 2023, vol.676 IFIP, pp.411-424 identifier

  • Publication Type: Conference Paper / Full Text
  • Volume: 676 IFIP
  • Doi Number: 10.1007/978-3-031-34107-6_33
  • City: Leon
  • Country: Spain
  • Page Numbers: pp.411-424
  • Keywords: Conceptual similarity, Feature extraction, Key term extraction, Natural language processing (NLP), Naïve Bayes classifier, subject/topic assignment, Supervised machine learning, TÜBİTAK
  • Middle East Technical University Affiliated: Yes


The information retrieved from texts plays crucial roles in many aspects. Although there are significant attempts on natural language processing for various types of texts in Turkish, none of them deals with academic texts. This study mainly aims to retrieve precise key terms from Turkish academic texts in the field of engineering and develops algorithms for similarity detection and automatic classification based on these key terms. In the first step of this study: a library and customized templates, that can transform the n-grams into structured forms, are created by considering the features of engineering terminology and the grammar of the Turkish language. Then, a customized similarity detection algorithm is developed. Finally, the Naïve Bayes Classifier is used to assign the documents to the appropriate engineering sub-fields. The project proposals submitted to The Scientific and Technological Research Council of Turkey (TÜBİTAK) Academic Research Funding Program Directorate (ARDEB) are analyzed as a case study. The results indicate that the proposed similarity algorithm correctly detects almost all of the re-submitted proposals while the accuracy of the classifier is 83.3% in the first prediction and reaches up to 96.4% in the first three predictions over a sample of 1255 proposals.