Most state-of-the-art approaches for Facial Action Unit (AU) detection rely on evaluating static frames, encoding a snapshot of heightened facial activity. In real-world interactions, however, facial expressions are more subtle and evolve over time requiring AU detection models to learn spatial as well as temporal information. In this work, we focus on both spatial and spatio-temporal features encoding the temporal evolution of facial AU activation. We propose the Action Unit Lifecycle-Aware Capsule Network (AULA-Caps) for AU detection using both frame and sequence-level features. While, at the frame-level, the capsule layers of AULA-Caps learn spatial feature primitives to determine AU activations, at the sequence-level, it learns temporal dependencies between contiguous frames by focusing on relevant spatio-temporal segments in the sequence. The learnt feature capsules are routed together such that the model learns to selectively focus on spatial or spatio-temporal information depending upon the AU lifecycle. The proposed model is evaluated on popular benchmarks, namely BP4D and GFT datasets, obtaining state-of-the-art results for both.