Design of Efficient AI Accelerator Building Blocks in Quantum-Dot Cellular Automata (QCA)

Mamdouh, Ahmed; Mjema, Mbonea; Yemiscioglu, Gurtac; Kondo, Satoshi; Muhtaroglu, Ali

doi:10.1109/jetcas.2022.3202043

Design of Efficient AI Accelerator Building Blocks in Quantum-Dot Cellular Automata (QCA)

Mamdouh A., Mjema M., Yemiscioglu G., Kondo S., Muhtaroglu A.

IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS, cilt.12, sa.3, ss.703-712, 2022 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 12 Sayı: 3
Basım Tarihi: 2022
Doi Numarası: 10.1109/jetcas.2022.3202043
Dergi Adı: IEEE JOURNAL ON EMERGING AND SELECTED TOPICS IN CIRCUITS AND SYSTEMS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC
Sayfa Sayıları: ss.703-712
Anahtar Kelimeler: Quantum cellular automata, quantum-dots, nanotechnology, integrated circuits, Al accelerators, SRAM, MOLECULAR-QCA
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

Digital circuit design technologies based on Quantum-Dot Cellular Automata (QCA) have many advantages over CMOS, such as higher intrinsic switching speed up to Terahertz, lower power consumption, smaller circuit footprint, and higher throughput due to compatibility of the inherent signal propagation scheme with pipelining. Hence, QCA is a perfect candidate to provide a circuit design framework for applications such as Artificial Intelligence (AI) accelerators, where real-time energy-efficient performance needs to he delivered at low cost. A novel QCA design approach based on optimal mix of Majority and NAND-NOR-INVERTER (NNI) gates with USE (Universal, Scalable, Efficient) clocking scheme, has been investigated in this work for latency and energy consumption improvements to fundamental building blocks in AI-accelerators, including multipliers, adders, accumulators and SRAMs. The common 4 x 4 Vedic multiplier has been redesigned using the proposed approach, and simulated to yield 62.8% reduction in cell count, 82.2% reduction in area, and 71.2% reduction in latency. 83% reduction in cell count, 94.5% reduction in area, and 94.6% reduction in latency was simulated for the proposed 8-bit PIPO register. The proposed SRAM cell design is estimated to have similar improvement figures to those achieved by the sub-blocks, such as the D-Latch, which has been simulated to exhibit 44.4% reduction in cell count, 50% reduction in both area and latency, and 73% reduction in energy dissipation. The contributions from this work can be directly applied to low cost, high throughput, energy efficient AI-accelerators that can potentially deliver orders of magnitude better energy-delay characteristics than their CMOS counterparts, and significantly better energy-delay characteristics than state-of-the-art QCA implementations.