Caption generation on scenes with seen and unseen object categories


Demirel B., CİNBİŞ R. G.

Image and Vision Computing, cilt.124, 2022 (SCI-Expanded) identifier identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 124
  • Basım Tarihi: 2022
  • Doi Numarası: 10.1016/j.imavis.2022.104515
  • Dergi Adı: Image and Vision Computing
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, Applied Science & Technology Source, Biotechnology Research Abstracts, Computer & Applied Sciences, INSPEC
  • Anahtar Kelimeler: Zero -shot learning, Zero -shot image captioning
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Image caption generation is one of the most challenging problems at the intersection of vision and language domains. In this work, we propose a realistic captioning task where the input scenes may incorporate visual objects with no corresponding visual or textual training examples. For this problem, we propose a detection-driven approach that consists of a single-stage generalized zero-shot detection model to recognize and localize instances of both seen and unseen classes, and a template-based captioning model that transforms detections into sentences. To improve the generalized zero-shot detection model, which provides essential information for captioning, we define effective class representations in terms of class-to-class semantic similarities, and leverage their special structure to construct an effective unseen/seen class confidence score calibration mechanism. We also propose a novel evaluation metric that provides additional insights for the captioning outputs by separately measuring the visual and non-visual contents of generated sentences. Our experiments highlight the importance of studying captioning in the proposed zero-shot setting, and verify the effectiveness of the proposed detection-driven zero-shot captioning approach.