Enhancing Address Data Integrity using Transformer-Based Language Models Dönüştürücü Tabanlı Dil Modelleri Kullanarak Adres Veri Bütünlüğünün Geliştirilmesi


Kürklü Ö. F., AKAGÜNDÜZ E.

32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024, Mersin, Türkiye, 15 - 18 Mayıs 2024 identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/siu61531.2024.10601149
  • Basıldığı Şehir: Mersin
  • Basıldığı Ülke: Türkiye
  • Anahtar Kelimeler: Address Standardization, Fine-Tuning, Synthetic Data, Transformers, Turkish Address Dataset
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Address data integrity is a critical aspect in numerous applications, yet it is often plagued with inaccuracies and inconsistencies, particularly in non-standardized formats. This study explores a novel application of transformer-based language models, traditionally utilized in language translation tasks, for the standardization and correction of Turkish address data. Leveraging the capabilities of Mixtral-8x7B, a state-of-the-art large language model, this research introduces a unique, handcrafted dataset of Turkish addresses. This dataset, derived from the National Address Dataset and enriched through ChatGPT-4 to simulate human-like input errors.This dataset was later used in fine-tuning both TowerInstruct and T5 models, transforming them into tools capable of converting faulty, error-laden address lines into standardized, structured, and corrected formats.