Most of the mobile applications require efficient and precise computation of the device pose, and almost every mobile device has inertial sensors already equipped together with a camera. This fact makes sensor fusion quite attractive for increasing efficiency during pose tracking. However, the state-of-the-art fusion algorithms have a major shortcoming: lack of well-defined uncertainty introduced to the system during the prediction stage of the fusion filters. Such a drawback results in determining covariances heuristically, and hence, requirement for data-dependent tuning to achieve high performance or even convergence of these filters. In this paper, we propose an inertially-aided visual odometry system that requires neither heuristics nor parameter tuning; computation of the required uncertainties on all the estimated variables are obtained after minimum number of assumptions. Moreover, the proposed system simultaneously estimates the metric scale of the pose computed from a monocular image stream. The experimental results indicate that the proposed scale estimation outperforms the state-of-the-art methods, whereas the pose estimation step yields quite acceptable results in real-time on resource constrained systems.