LLM-Generated Rewrite and Context Modulation for Enhanced Vision Language Models in Digital Pathology


Bahadir C. D., AKAR G., Sabuncu M. R.

2025 IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2025, Arizona, Amerika Birleşik Devletleri, 28 Şubat - 04 Mart 2025, ss.327-336, (Tam Metin Bildiri) identifier

  • Yayın Türü: Bildiri / Tam Metin Bildiri
  • Doi Numarası: 10.1109/wacv61041.2025.00042
  • Basıldığı Şehir: Arizona
  • Basıldığı Ülke: Amerika Birleşik Devletleri
  • Sayfa Sayıları: ss.327-336
  • Anahtar Kelimeler: digital pathology, large language models, vision language models
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Recent advancements in vision-language models (VLMs) have found important applications in medical imaging, particularly in digital pathology. VLMs demand large-scale datasets of image-caption pairs, which is often hard to obtain in medical domains. State-of-the-art VLMs in digital pathology have been pre-trained on datasets that are significantly smaller than their computer vision counterparts. Furthermore, the caption of a pathology slide often refers to a small sub-set of features in the image-an important point that is ignored in existing VLM pre-training schemes. Another important issue that is under-appericated is that the performance of state-of-the-art VLMs in zero-shot classification tasks can be sensitive to the choice of the prompts. In this paper, we first employ language rewrites using a large language model (LLM) to enrich a public pathology image-caption dataset and make it publicly available. Our extensive experiments demonstrate that by training with language rewrites, we can boost the performance of a state-of-the-art digital pathology VLM on downstream tasks such as zero-shot classification, and text-to-image and image-to-text retrieval. We further leverage LLMs to demonstrate the sensitivity of zero-shot classification results to the choice of prompts and propose a scalable approach to characterize this when comparing models. Finally, we present a novel context modulation layer that adjusts the image embeddings for better aligning with the paired text and use context-specific language rewrites for training this layer. In our results, we show that the proposed context modulation framework can further yield substantial performance gains.