Multi-modal egocentric activity recognition using multi-kernel learning

Arabaci, Mehmet; Ozkan, Fatih; SÜRER, ELİF; Jancovic, Peter; TEMİZEL, ALPTEKİN

doi:10.1007/s11042-020-08789-7

Multi-modal egocentric activity recognition using multi-kernel learning

Atıf İçin Kopyala

Arabaci M. A., Ozkan F., SÜRER E., Jancovic P., TEMİZEL A.

Multimedia Tools and Applications, cilt.80, sa.11, ss.16299-16328, 2021 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 80 Sayı: 11
Basım Tarihi: 2021
Doi Numarası: 10.1007/s11042-020-08789-7
Dergi Adı: Multimedia Tools and Applications
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, FRANCIS, ABI/INFORM, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
Sayfa Sayıları: ss.16299-16328
Anahtar Kelimeler: Egocentric, First-person vision, Activity recognition, Multi-kernel learning, Multi-modality, SPEAKER, FUSION, KERNEL, DEPTH
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

© 2020, Springer Science+Business Media, LLC, part of Springer Nature.Existing methods for egocentric activity recognition are mostly based on extracting motion characteristics from videos. On the other hand, ubiquity of wearable sensors allow acquisition of information from different sources. Although the increase in sensor diversity brings out the need for adaptive fusion, most of the studies use pre-determined weights for each source. In addition, there are a limited number of studies making use of optical, audio and wearable sensors. In this work, we propose a new framework that adaptively weighs the visual, audio and sensor features in relation to their discriminative abilities. For that purpose, multi-kernel learning (MKL) is used to fuse multi-modal features where the feature and kernel selection/weighing and recognition tasks are performed concurrently. Audio-visual information is used in association with the data acquired from wearable sensors since they hold information on different aspects of activities and help building better models. The proposed framework can be used with different modalities to improve the recognition accuracy and easily be extended with additional sensors. The results show that using multi-modal features with MKL outperforms the existing methods.