Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning

Aksoy N., Sharoff S., Ravikumar N.

Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, İtalya, 20 - 25 Mayıs 2024, ss.5977-5985, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Basıldığı Şehir: Turin
Basıldığı Ülke: İtalya
Sayfa Sayıları: ss.5977-5985
Orta Doğu Teknik Üniversitesi Adresli: Hayır

Özet

Image-to-text generation involves automatic generation of descriptive text from images, with this paper focusing on

generation of reports from X-ray images. However, traditional approaches often exhibit a semantic gap between

visual and textual information. In this paper, we propose a multi-task learning framework to leverage both visual and

non-imaging data for generating radiology reports. Along with chest X-ray images, 10 additional features comprising

numeric, binary, categorical, and text data were incorporated to create a unified representation. The model was

trained to generate text, predict the degree of patient severity, and identify medical findings. Multi-task learning,

especially with text generation prioritisation, improved performance over single-task baselines across language

generation metrics. The framework also mitigated overfitting in auxiliary tasks compared to single-task models.

Qualitative analysis shows more coherent narratives and more accurate identification of findings, though some

repetition and disjointed phrasing remain. This study demonstrates the benefits of multi-modal, multi-task learning for

image-to-text generation applications.