© 2022 Elsevier LtdAlthough occupancy information is critical to energy consumption of existing buildings, it still remains to be a major source of uncertainty. For reliable and accurate occupant modeling with minimal uncertainties, capturing precise occupant information on occupants is essential. This paper proposes a computer vision-based approach that utilizes deep learning architectures to estimate of the number of people in large, crowded spaces using multiple cameras. Various vision techniques (head detection, background elimination, head tracking) are implemented in three methods: (i) a method that instantaneously counts people in a scene, (ii) a method that incrementally counts people entering/exiting a room and (iii) a combination of the first two methods. These methods were applied in a classroom with heavy occlusions, and resulted in a high prediction capacity when compared to ground truth measurements. Future work in video-analytical approaches can address problems regarding lowering the computational cost of analysis, capturing occupancy data in complex room geometries and addressing concerns in privacy preservation.