SALDIRAY: Scalable and Adaptive Language Diagnostics for Remediation of Anomalies and Typos

Külah, Emre; Çetinkaya, YUSUF; Alemdar, Hande

doi:10.1109/access.2026.3693355

SALDIRAY: Scalable and Adaptive Language Diagnostics for Remediation of Anomalies and Typos

Külah E., Çetinkaya Y. M., Alemdar H.

IEEE ACCESS, cilt.14, ss.1-20, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 14
Basım Tarihi: 2026
Doi Numarası: 10.1109/access.2026.3693355
Dergi Adı: IEEE ACCESS
Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), Compendex, INSPEC, Directory of Open Access Journals
Sayfa Sayıları: ss.1-20
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Natural language processing (NLP) systems working with real-world user data often encounter noisy, informal text, including typos, spelling variations, slang, and homophone substitutions. This noise can cause a severe drop in the performance of downstream models. This paper presents SALDIRAY (Scalable and Adaptive Language Diagnostics for Remediation of Anomalies and Typos), an integrated, domain-independent framework for text normalization that aims to boost the resilience of NLP systems across various model architectures and application types. It converts short, poorly formed text into standardized language through a synthetic noise modeling and standardization approach, instantiated in practice as a noise generation and normalization pipeline. This approach is enriched by incorporating curated slang and homophone lists to ensure linguistic realism. The process creates large amounts of paired noisy–clean data, which are used to fine-tune robust normalization models. Our detailed experiments consistently show that SALDIRAYimproves the accuracy and stability of systems across a wide range of tasks, including traditional machine learning–based classification, sentiment analysis, and instruction-tuned large language model tasks. The framework requires no task-specific supervision, operates as a modular preprocessing layer, and can be integrated into existing NLP workflows. The experiments confirm that the proposed framework offers a scalable and adaptive method for improving robustness to noisy text, particularly under controlled corruption settings, while showing promising transfer to real-world noisy inputs.