In this paper, a multimodal video indexing and retrieval system, MMVIRS, is presented. MMVIRS models the auditory, visual, and textual sources of video collections from a semantic perspective. Besides multimodality, our model is constituted on semantic hierarchies that enable us to access the video from different semantic levels. MMVIRS has been implemented with data annotation, querying and browsing parts. In the annotation part, metadata information and video semantics are extracted in hierarchical ways. In the querying part, semantic queries, spatial queries, regional queries, spatio-temporal queries, and temporal queries have been processed over video collections using the proposed model. In the browsing parts, video collections are navigated using category information, visual and auditory hierarchies.