Enhancing Address Data Integrity using Transformer-Based Language Models Dönüştürücü Tabanlı Dil Modelleri Kullanarak Adres Veri Bütünlüğünün Geliştirilmesi

Kürklü Ö. F., Akagündüz E.

32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024, Mersin, Türkiye, 15 - 18 Mayıs 2024, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Doi Numarası: 10.1109/siu61531.2024.10601149
Basıldığı Şehir: Mersin
Basıldığı Ülke: Türkiye
Anahtar Kelimeler: Address Standardization, Fine-Tuning, Synthetic Data, Transformers, Turkish Address Dataset
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Address data integrity is a critical aspect in numerous applications, yet it is often plagued with inaccuracies and inconsistencies, particularly in non-standardized formats. This study explores a novel application of transformer-based language models, traditionally utilized in language translation tasks, for the standardization and correction of Turkish address data. Leveraging the capabilities of Mixtral-8x7B, a state-of-the-art large language model, this research introduces a unique, handcrafted dataset of Turkish addresses. This dataset, derived from the National Address Dataset and enriched through ChatGPT-4 to simulate human-like input errors.This dataset was later used in fine-tuning both TowerInstruct and T5 models, transforming them into tools capable of converting faulty, error-laden address lines into standardized, structured, and corrected formats.