SALDIRAY: Scalable and Adaptive Language Diagnostics for Remediation of Anomalies and Typos


Külah E., Çetinkaya Y. M., Alemdar H.

IEEE ACCESS, cilt.14, ss.1-20, 2026 (SCI-Expanded, Scopus)

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 14
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1109/access.2026.3693355
  • Dergi Adı: IEEE ACCESS
  • Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), Compendex, INSPEC, Directory of Open Access Journals
  • Sayfa Sayıları: ss.1-20
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Natural language processing (NLP) systems working with real-world user data often encounter noisy, informal text, including typos, spelling variations, slang, and homophone substitutions. This noise can cause a severe drop in the performance of downstream models. This paper presents SALDIRAY (Scalable and Adaptive Language Diagnostics for Remediation of Anomalies and Typos), an integrated, domain-independent framework for text normalization that aims to boost the resilience of NLP systems across various model architectures and application types. It converts short, poorly formed text into standardized language through a synthetic noise modeling and standardization approach, instantiated in practice as a noise generation and normalization pipeline. This approach is enriched by incorporating curated slang and homophone lists to ensure linguistic realism. The process creates large amounts of paired noisy–clean data, which are used to fine-tune robust normalization models. Our detailed experiments consistently show that SALDIRAYimproves the accuracy and stability of systems across a wide range of tasks, including traditional machine learning–based classification, sentiment analysis, and instruction-tuned large language model tasks. The framework requires no task-specific supervision, operates as a modular preprocessing layer, and can be integrated into existing NLP workflows. The experiments confirm that the proposed framework offers a scalable and adaptive method for improving robustness to noisy text, particularly under controlled corruption settings, while showing promising transfer to real-world noisy inputs.