Enhancing Address Data Integrity using Transformer-Based Language Models Dönüştürücü Tabanlı Dil Modelleri Kullanarak Adres Veri Bütünlüğünün Geliştirilmesi


Kürklü Ö. F., AKAGÜNDÜZ E.

32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024, Mersin, Turkey, 15 - 18 May 2024 identifier identifier

  • Publication Type: Conference Paper / Full Text
  • Doi Number: 10.1109/siu61531.2024.10601149
  • City: Mersin
  • Country: Turkey
  • Keywords: Address Standardization, Fine-Tuning, Synthetic Data, Transformers, Turkish Address Dataset
  • Middle East Technical University Affiliated: Yes

Abstract

Address data integrity is a critical aspect in numerous applications, yet it is often plagued with inaccuracies and inconsistencies, particularly in non-standardized formats. This study explores a novel application of transformer-based language models, traditionally utilized in language translation tasks, for the standardization and correction of Turkish address data. Leveraging the capabilities of Mixtral-8x7B, a state-of-the-art large language model, this research introduces a unique, handcrafted dataset of Turkish addresses. This dataset, derived from the National Address Dataset and enriched through ChatGPT-4 to simulate human-like input errors.This dataset was later used in fine-tuning both TowerInstruct and T5 models, transforming them into tools capable of converting faulty, error-laden address lines into standardized, structured, and corrected formats.