This paper is aimed at creating extremely small and fast convolutional neural networks (CNN) for the problem of facial expression recognition (FER) from frontal face images. To this end, we employed the popular knowledge distillation (KD) method and identified two major shortcomings with its use: 1) a fine-grained grid search is needed for tuning the temperature hyperparameter and 2) to find the optimal size-accuracy balance, one needs to search for the final network size (or the compression rate). On the other hand, KD is proved to be useful for model compression for the FER problem, and we discovered that its effects get more and more significant with decreasing model size. In addition, we hypothesized that translation invariance achieved using max-pooling layers would not be useful for the FER problem as the expressions are sensitive to small, pixel wise changes around the eye and the mouth. However, we have found an intriguing improvement in generalization when max-pooling is used. We conducted experiments on two widely used FER datasets, CK+ and Oulu-CASIA. Our smallest model (MicroExpNet), obtained using knowledge distillation, is less than 1MB in size and works at 1851 frames per second on an Intel i7 CPU. Despite being less accurate than the state-of-the-art, MicroExpNet still provides significant insights for designing a microarchitecture for the FER problem.