Software reliability is an important factor for the success of a software project. Accurate modelling of software reliability enables estimation of remaining defects, the timing of deployment and required future effort. These factors contribute to successful planning of project schedule and resources. A number of software reliability prediction models have been proposed, each with different assumptions regarding software defect introduction and discovery. The performances of these models differ depending on the properties of the software project they are applied. Model averaging and stacking techniques offer flexible approaches for combining the predictions of different models based on observed data. In this study, we use model stacking and averaging approaches to combine the predictions of four well-known Non-Homogeneous Poisson Process (NHPP) software reliability models. These models have different assumptions with respect to failure rate, residual defects and the overall reliability of the software being investigated. We evaluate these techniques in simulated experiments and then apply the techniques to defect data collected from four software projects with different characteristics. Our results show that stacking and averaging approaches provide a robust approach with consistently high-performance results for both simulated defect experiments and actual defect data, whereas the performance of individual prediction models varies between different projects.