Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Turin, Italy, 20 - 25 May 2024, pp.5977-5985, (Full Text)
Image-to-text generation involves automatic generation of descriptive text from images, with this paper focusing on
generation of reports from X-ray images. However, traditional approaches often exhibit a semantic gap between
visual and textual information. In this paper, we propose a multi-task learning framework to leverage both visual and
non-imaging data for generating radiology reports. Along with chest X-ray images, 10 additional features comprising
numeric, binary, categorical, and text data were incorporated to create a unified representation. The model was
trained to generate text, predict the degree of patient severity, and identify medical findings. Multi-task learning,
especially with text generation prioritisation, improved performance over single-task baselines across language
generation metrics. The framework also mitigated overfitting in auxiliary tasks compared to single-task models.
Qualitative analysis shows more coherent narratives and more accurate identification of findings, though some
repetition and disjointed phrasing remain. This study demonstrates the benefits of multi-modal, multi-task learning for
image-to-text generation applications.