11th International Workshop on Arithmetic of Finite Fields, WAIFI 2026, Santander, İspanya, 3 - 05 Haziran 2026, cilt.16611 LNCS, ss.300-327, (Tam Metin Bildiri)
Many cryptographic schemes, such as fully homomorphic encryption (FHE), zero-knowledge proofs, and post-quantum cryptography, rely on fast multiplication of large polynomials. Efficient algorithms for computing the discrete Fourier transform, including the Fast Fourier Transform (FFT), Negacyclic FFT (FFNT), and Number Theoretic Transform (NTT), can be employed to accelerate these multiplications. In this paper, we introduce an improved version of the FFNT that eliminates pre- and post-processing steps, and we show that it provides competitive performance among FFT- and FFNT-based methods for polynomial multiplication on GPUs, including implementations based on the highly optimized CUDA library cuFFT. We then compare the performance of the proposed FFNT with that of the NTT when both are implemented in CUDA for GPU platforms. Our results show that FFNT can outperform NTT in both speed and energy efficiency on GPUs with sufficiently strong floating-point hardware for some FHE settings. In particular, for polynomial multiplication, the FFNT kernel achieves up to 60% higher energy efficiency than the NTT kernel on such devices. Furthermore, in a practical setting, our GPU-based implementation of torus FHE (TFHE) using FFNT for NAND gate bootstrapping can be up to 46% faster than its NTT-based counterpart when ample floating-point resources are available. These results suggest that the choice between NTT- and FFNT-based polynomial multiplication should be guided by the underlying GPU architecture and by the precision and efficiency requirements of the target FHE workload.