With improvements in the area of Internet of Things (IoT), surveillance systems have recently become more accessible. At the same time, optimizing the energy requirements of smart sensors, especially for data transmission, has always been very important and the energy efficiency of IoT systems has been the subject of numerous studies. For environmental monitoring scenarios, it is possible to extract more accurate information using smart multimedia sensors. However, multimedia data transmission is an expensive operation. In this study, a novel hierarchical approach is presented for the detection of forest fires. The proposed framework introduces a new approach in which multimedia and scalar sensors are used hierarchically to minimize the transmission of visual data. A lightweight deep learning model is also developed for devices at the edge of the network to improve detection accuracy and reduce the traffic between the edge devices and the sink. The framework is evaluated using a real testbed, network simulations, and 10-fold cross-validation in terms of energy efficiency and detection accuracy. Based on the results of our experiments, the validation accuracy of the proposed system is 98.28%, and the energy saving is 29.94%. The proposed deep learning model's validation accuracy is very close to the accuracy of the best performing architectures when the existing studies and lightweight architectures are considered. In terms of suitability for edge computing, the proposed approach is superior to the existing ones with reduced computational requirements and model size.