JOURNAL OF COMPUTATIONAL SCIENCE, cilt.94, 2026 (SCI-Expanded, Scopus)
The particle filter is a serial Monte Carlo estimation method. It is used in tracking applications in which the system or measurement model is highly nonlinear. The quality of the estimation improves as the number of particles increases; however, the computational cost also rises. The graphics processing units (GPUs) offer a promising solution for the particle filter by providing many cores in their architectures. To implement the particle filter on the GPU, we use CUDA as the parallel computing platform. The architecture of the GPU must be carefully considered when determining the parameters of CUDA kernels. Configuring the block size of CUDA kernels appropriately is essential for the efficient utilization of streaming multiprocessors (SMXs). In this study, we investigate the impact of block size on SMX efficiency, particularly in GPUs where the number of SMXs is not a power of two. We propose three distinct scenarios based on different block size configurations and provide a detailed discussion of the characteristics and resulting speedups of these scenarios. We conduct experiments on two different GPU boards, NVIDIA Tesla K20 and NVIDIA Tesla K40. In addition, we demonstrate a multi-GPU approach for the particle filter using these boards and discuss the associated challenges and resulting speedups in detail.