MULTIMEDIA SYSTEMS, cilt.24, sa.1, ss.55-72, 2018 (SCI-Expanded)
In this paper, we propose a multi-modal event recognition framework based on the integration of feature fusion, deep learning, scene classification and decision fusion. Frames, shots, and scenes are identified through the video decomposition process. Events are modeled utilizing features of and relations between the physical video parts. Event modeling is achieved through visual concept learning, scene segmentation and association rule mining. Visual concept learning is employed to reveal the semantic gap between the visual content and the textual descriptors of the events. Association rules are discovered by a specialized association rule mining algorithm where the proposed strategy integrates temporality into the rule discovery process. In addition to frames, shots and scenes, the concept of scene segment is proposed to define and extract elements of association rules. Various feature sources such as audio, motion, keypoint descriptors, temporal occurrence characteristics and fully connected layer outputs of CNN model are combined into the feature fusion. The proposed decision fusion approach employs logistic regression to formulate the relation between dependent variable (event type) and independent variables (classifiers' outputs) in terms of decision weights. Multi-modal fusion-based scene classifiers are employed in the event recognition. Rule-based event modeling and multi-modal fusion capability are shown to be promising approaches for event recognition. The decision fusion results are promising and the proposed algorithm is open to the fusion of new sources for further improvements. The proposal is also open to new event type integrations. The accuracy of the proposed methodology is evaluated on the CCV and Hollywood2 dataset for event recognition and results are compared with the benchmark implementations in the literature.