Neural Computing and Applications, 2024 (SCI-Expanded)
The widespread adoption of wearable devices equipped with advanced sensor technologies has fueled the rapid growth of egocentric video capture, known as First Person Vision (FPV). Unlike traditional third-person videos, FPV exhibits distinct characteristics such as significant ego-motions and frequent scene changes, rendering conventional vision-based methods ineffective. This paper introduces a novel audio-visual decision fusion framework for egocentric activity recognition (EAR) that addresses these challenges. The proposed framework employs a two-stage decision fusion pipeline with explicit weight learning, integrating both audio and visual cues to enhance overall recognition performance. Additionally, a new publicly available dataset, the Egocentric Outdoor Activity Dataset, comprising 1392 video clips featuring 30 diverse outdoor activities, is also introduced to facilitate comparative evaluations of EAR algorithms and spur further research in the field. Experimental results demonstrate that the integration of audio and visual information significantly improves activity recognition performance, outperforming single modality approaches and equally weighted decisions from multiple modalities.