Language acquisition unfolds within inherently multimodal contexts, where communication is expressed and perceived through diverse channels embedded in social interactions. For hearing children, this involves integrating speech with gesture; for deaf children, language develops through fully visual modalities. Such observations necessitate a paradigm shift from speech-centric models to a holistic framework that equally values all modalities, whether in spoken or signed languages. This framework must account not only for the multimodal scaffolding of input and interaction but also for individual and contextual diversity, including the cultural and cognitive variabilities children bring to language learning contexts. Responding to commentaries on our target article, this paper refines and expands the multimodal language framework, emphasizing its capacity to integrate the interactive richness of input and the heterogeneous contexts and individual variations shaping language acquisition.