32nd IEEE Conference on Signal Processing and Communications Applications, SIU 2024, Mersin, Türkiye, 15 - 18 Mayıs 2024
Address data integrity is a critical aspect in numerous applications, yet it is often plagued with inaccuracies and inconsistencies, particularly in non-standardized formats. This study explores a novel application of transformer-based language models, traditionally utilized in language translation tasks, for the standardization and correction of Turkish address data. Leveraging the capabilities of Mixtral-8x7B, a state-of-the-art large language model, this research introduces a unique, handcrafted dataset of Turkish addresses. This dataset, derived from the National Address Dataset and enriched through ChatGPT-4 to simulate human-like input errors.This dataset was later used in fine-tuning both TowerInstruct and T5 models, transforming them into tools capable of converting faulty, error-laden address lines into standardized, structured, and corrected formats.