This study is focused on a new approach for addressing the trade-off between accuracy and energy-efficiency of Wireless Multimedia Sensor Networks. Although a number of previous studies have focused on various special topics in Wireless Multimedia Sensor Networks in detail, to best of our knowledge, none presents a fuzzy multi-modal data fusion system, which is light-weight and provides a high accuracy ratio. Especially, multi-modal data fusion targeting surveillance applications make it inevitable to work within a multi-level hierarchical framework. In this study, we primarily focus on accuracy and efficiency by utilizing such a framework. In order to evaluate the performance of the proposed framework, a set of experiments is conducted and obtained results are presented.