Fast and Energy-Efficient Polynomial Multiplication Using FFT, FFNT, and NTT on GPUs for Fully Homomorphic Encryption

Özcan A. Ş., TEZCAN C., Savaş E.

11th International Workshop on Arithmetic of Finite Fields, WAIFI 2026, Santander, İspanya, 3 - 05 Haziran 2026, cilt.16611 LNCS, ss.300-327, (Tam Metin Bildiri)

Yayın Türü: Bildiri / Tam Metin Bildiri
Cilt numarası: 16611 LNCS
Doi Numarası: 10.1007/978-3-032-27574-5_19
Basıldığı Şehir: Santander
Basıldığı Ülke: İspanya
Sayfa Sayıları: ss.300-327
Anahtar Kelimeler: Fast Fourier Transform, Fully Homomorphic Encryption, Graphics Processing Units, Negacyclic Fast Fourier Transform, Number Theoretic Transform, Polynomial Multiplication
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Many cryptographic schemes, such as fully homomorphic encryption (FHE), zero-knowledge proofs, and post-quantum cryptography, rely on fast multiplication of large polynomials. Efficient algorithms for computing the discrete Fourier transform, including the Fast Fourier Transform (FFT), Negacyclic FFT (FFNT), and Number Theoretic Transform (NTT), can be employed to accelerate these multiplications. In this paper, we introduce an improved version of the FFNT that eliminates pre- and post-processing steps, and we show that it provides competitive performance among FFT- and FFNT-based methods for polynomial multiplication on GPUs, including implementations based on the highly optimized CUDA library cuFFT. We then compare the performance of the proposed FFNT with that of the NTT when both are implemented in CUDA for GPU platforms. Our results show that FFNT can outperform NTT in both speed and energy efficiency on GPUs with sufficiently strong floating-point hardware for some FHE settings. In particular, for polynomial multiplication, the FFNT kernel achieves up to 60% higher energy efficiency than the NTT kernel on such devices. Furthermore, in a practical setting, our GPU-based implementation of torus FHE (TFHE) using FFNT for NAND gate bootstrapping can be up to 46% faster than its NTT-based counterpart when ample floating-point resources are available. These results suggest that the choice between NTT- and FFNT-based polynomial multiplication should be guided by the underlying GPU architecture and by the precision and efficiency requirements of the target FHE workload.