Multilingual Domain Adaptation for Speech Recognition Using LLMs

Ulu E. N., Derya E., Tumer D., Demirel B., Karamanlioglu A.

28th International Conference on Text Speech and Dialogue-TSD-Annual, Erlangen, Almanya, 25 - 28 Ağustos 2025, cilt.16029, ss.381-393, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 16029
Doi Numarası: 10.1007/978-3-032-02548-7_32
Basıldığı Şehir: Erlangen
Basıldığı Ülke: Almanya
Sayfa Sayıları: ss.381-393
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

We present a practical pipeline for multilingual domain adaptation in automatic speech recognition (ASR) that combines the Whisper model with large language models (LLMs). Using Aya-23-8B, Common Voice transcripts in 22 languages are automatically classified into the Law and Healthcare domains, producing high-quality domain labels at a fraction of the manual cost. These labels drive parameterefficient (LoRA) fine-tuning of Whisper and deliver consistent relative Word Error Rate (WER) reductions of up to 14.3% for languages that contribute at least 800 in-domain utterances. A data-volume analysis reveals a clear breakpoint: gains become reliably large once that 800-utterance threshold is crossed, while monolingual tuning still rescues performance in truly low-resource settings. The workflow therefore shifts the key success factor from expensive hand labelling to scalable data acquisition, and can be replicated in new domains with minimal human intervention.