Enhancing Image-to-Text Generation in Radiology Reports through Cross-modal Multi-Task Learning


Creative Commons License

Aksoy N., Sharoff S., Ravikumar N.

Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy, 20 - 25 May 2024, pp.5977-5985, (Full Text) identifier

  • Publication Type: Conference Paper / Full Text
  • City: Turin
  • Country: Italy
  • Page Numbers: pp.5977-5985
  • Middle East Technical University Affiliated: No

Abstract

Image-to-text generation involves automatic generation of descriptive text from images, with this paper focusing on

generation of reports from X-ray images. However, traditional approaches often exhibit a semantic gap between

visual and textual information. In this paper, we propose a multi-task learning framework to leverage both visual and

non-imaging data for generating radiology reports. Along with chest X-ray images, 10 additional features comprising

numeric, binary, categorical, and text data were incorporated to create a unified representation. The model was

trained to generate text, predict the degree of patient severity, and identify medical findings. Multi-task learning,

especially with text generation prioritisation, improved performance over single-task baselines across language

generation metrics. The framework also mitigated overfitting in auxiliary tasks compared to single-task models.

Qualitative analysis shows more coherent narratives and more accurate identification of findings, though some

repetition and disjointed phrasing remain. This study demonstrates the benefits of multi-modal, multi-task learning for

image-to-text generation applications.