Impact of Tokenization on Language Models: An Analysis for Turkish

Toraman, ÇAĞRI; Yilmaz, Eyup; Sahinuc, Furkan; Ozcelik, Oguzhan

doi:10.1145/3578707

Impact of Tokenization on Language Models: An Analysis for Turkish

Toraman Ç., Yilmaz E. H., Sahinuc F., Ozcelik O.

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, cilt.22, sa.4, 2023 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 22 Sayı: 4
Basım Tarihi: 2023
Doi Numarası: 10.1145/3578707
Dergi Adı: ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus
Orta Doğu Teknik Üniversitesi Adresli: Hayır

Özet

Tokenization is an important text preprocessing step to prepare input tokens for deep languagemodels. WordPiece and BPE are de facto methods employed by important models, such as BERT and GPT. However, the impact of tokenization can be different for morphologically rich languages, such as Turkic languages, in which many words can be generated by adding prefixes and suffixes. We compare five tokenizers at different granularity levels, that is, their outputs vary from the smallest pieces of characters to the surface form of words, including a Morphological-level tokenizer. We train these tokenizers and pretrain medium-sized language models using the RoBERTa pretraining procedure on the Turkish split of the OSCAR corpus. We then fine-tune our models on six downstream tasks. Our experiments, supported by statistical tests, reveal that the morphological-level tokenizer delivers a challenging performance with de facto tokenizers. Furthermore, we find that increasing the vocabulary size improves the performance of Morphological- andWord-level tokenizers more than that of de facto tokenizers. The ratio of the number of vocabulary parameters to the total number of model parameters can be empirically chosen as 20% for de facto tokenizers and 40% for other tokenizers to obtain a reasonable trade-off between model size and performance.