Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

İRİMAĞZI, CANBERK; Uslan, Yusuf; HAREEDY, AHMED

doi:10.1109/tmbmc.2024.3400794

Protecting the Future of Information: LOCO Coding With Error Detection for DNA Data Storage

Atıf İçin Kopyala

İRİMAĞZI C., Uslan Y., HAREEDY A.

IEEE Transactions on Molecular, Biological, and Multi-Scale Communications, cilt.10, sa.2, ss.317-333, 2024 (ESCI)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 10 Sayı: 2
Basım Tarihi: 2024
Doi Numarası: 10.1109/tmbmc.2024.3400794
Dergi Adı: IEEE Transactions on Molecular, Biological, and Multi-Scale Communications
Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), Scopus, Compendex, INSPEC
Sayfa Sayıları: ss.317-333
Anahtar Kelimeler: balancing, Constrained codes, DNA data storage, error-detection, homopolymer run, LOCO codes, low-complexity algorithms, reconfigurable coding
Orta Doğu Teknik Üniversitesi Adresli: Evet

Özet

From the information-theoretic perspective, DNA strands serve as a storage medium for 4-ary data over the alphabet A,T,G,C. DNA data storage promises formidable information density, long-term durability, and ease of replicability. However, information in this intriguing storage technology might be corrupted because of error-prone data sequences as well as insertion, deletion, and substitution errors. Experiments have revealed that DNA sequences with long homopolymers and/or with low GC-content are notably more subject to errors upon storage. In order to address this biochemical challenge, constrained codes are proposed for usage in DNA data storage systems, and they are studied in the literature accordingly. This paper investigates the utilization of the recently-introduced method for designing lexicographically-ordered constrained (LOCO) codes in DNA data storage to improve performance. LOCO codes offer capacity-achievability, low complexity, and ease of reconfigurability. This paper introduces novel constrained codes, namely DNA LOCO (D-LOCO) codes, over the alphabet A,T,G,C with limited runs of identical symbols. Due to their ordered structure, these codes come with an encoding-decoding rule we derive, which provides simple and affordable encoding-decoding algorithms. In terms of storage overhead, the proposed encoding-decoding algorithms outperform those in the existing literature. Our algorithms are based on small-size adders, and therefore they are readily reconfigurable. D-LOCO codes are intrinsically balanced, which allows us to achieve balanced AT-and GC-content over the entire DNA strand with minimal rate penalty. Moreover, we propose four schemes to bridge consecutive codewords, three of which guarantee single substitution error detection per codeword. We examine the probability of undetecting errors over a presumed symmetric DNA storage channel subject to substitution errors only. We also show that D-LOCO codes are capacity-achieving and that they offer remarkably high rates even at moderate lengths.