In this work, we propose a framework for multimodal data fusion at decision level under a multilayer hierarchical ensemble learning architecture. The architecture provides a generative discriminative model for probability density estimations and decreases the entropy of the data throughout the vector spaces. The architecture is implemented for human motion detection problem, where the motion analysis problem is formulated as a multi-class classification problem on audio-visual data. The vector space transformations are analyzed by the investigation of probability density and entropy transitions of data across the levels. The architecture provides an efficient sensor fusion framework for the robotics research, object classification, target detection and tracking applications.