First Fully Pipelined High Throughput FPGA Implementation and GPU Optimization of Wider Variant of AES


Malal A., TEZCAN C.

Journal of Cryptographic Engineering, cilt.16, sa.1, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 16 Sayı: 1
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1007/s13389-025-00388-2
  • Dergi Adı: Journal of Cryptographic Engineering
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC
  • Anahtar Kelimeler: FPGA Implementation, GPU Optimization, High Throughput, Parallel Processing, Wider-AES
  • Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

In response to the recent NIST call for a wider variant of the AES algorithm, we developed a fully pipelined, high-throughput FPGA implementation of the 256-bit block size AES, referred to as WAES-256. This design targets both 7th generation and UltraScale+ FPGAs, focusing on maximizing throughput and efficient hardware utilization. Our work supports AES-128, AES-256, and WAES-256, employing composite field arithmetic in the S-box to reduce critical path delay. All AES layers are fully pipelined, enabling multiple levels of parallelism with minimal architectural changes. Our AES-128 implementations achieved the best throughput-per-slice (TPS) ratios reported in the literature for fair comparisons on the same FPGA platforms. For WAES-256, our designs reached 75.73 Gbps on Spartan-7, 72.32 Gbps on Artix-7, 199.46 Gbps on Zynq UltraScale+, and 206.11 Gbps on Kintex UltraScale+. Additionally, our multi-core parallel WAES-256 designs achieved 426.66 Gbps with x2 cores and 742.63 Gbps with x4 cores on the Kintex UltraScale+ platform, demonstrating the scalability of our approach. These results highlight the efficiency and scalability of our architectures, offering high-throughput performance without relying on BRAM (Block Random Access Memory), making them well-suited for next-generation cryptographic applications. Moreover, we optimized WAES-256 on GPUs and achieved performance comparable to the best AES-256 results. For instance, we achieved 3053.5 Gbps WAES-256 encryption in counter mode of operation on an RTX 4090. Our results show that using FPGAs or GPUs as co-processors for WAES-256 render encryption-free and transition from AES-256 to WAES-256 results in no observable slowdowns.