Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an Algorithm

Kaya, Semih; VURAL, ELİF

doi:10.1109/tip.2021.3071688

Learning Multi-Modal Nonlinear Embeddings: Performance Bounds and an Algorithm

Atıf İçin Kopyala

Kaya S., VURAL E.

IEEE TRANSACTIONS ON IMAGE PROCESSING, cilt.30, ss.4384-4394, 2021 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 30
Basım Tarihi: 2021
Doi Numarası: 10.1109/tip.2021.3071688
Dergi Adı: IEEE TRANSACTIONS ON IMAGE PROCESSING
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Academic Search Premier, PASCAL, Aerospace Database, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Compendex, Computer & Applied Sciences, EMBASE, INSPEC, MEDLINE, Metadex, zbMATH, Civil Engineering Abstracts
Sayfa Sayıları: ss.4384-4394
Anahtar Kelimeler: Training, Kernel, Interpolation, Data models, Geometry, Learning systems, Deep learning, Multi-modal learning, multi-view learning, cross-modal retrieval, nonlinear embeddings, supervised embeddings, RBF interpolators
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

While many approaches exist in the literature to learn low-dimensional representations for data collections in multiple modalities, the generalizability of multi-modal nonlinear embeddings to previously unseen data is a rather overlooked subject. In this work, we first present a theoretical analysis of learning multi-modal nonlinear embeddings in a supervised setting. Our performance bounds indicate that for successful generalization in multi-modal classification and retrieval problems, the regularity of the interpolation functions extending the embedding to the whole data space is as important as the between-class separation and cross-modal alignment criteria. We then propose a multi-modal nonlinear representation learning algorithm that is motivated by these theoretical findings, where the embeddings of the training samples are optimized jointly with the Lipschitz regularity of the interpolators. Experimental comparison to recent multi-modal and single-modal learning algorithms suggests that the proposed method yields promising performance in multi-modal image classification and cross-modal image-text retrieval applications.