Is DNA the Data Storage of the Future?


Knowledge is power and in the 21st Century those who control the data control the world. We create 463 exabytes of data in a single day. You know those book-sized 10 terabyte hard drives? An exabyte contains as much data as a library containing 4.6 million of those book-sized hard drives (check out this handy infographic on Raconteur). As mind-boggling a number as that is, we are making more data than we can reasonably store. This presents a long-term storage and archiving problem for data and biotech may have the solution.

DNA is Data

In each of our cells lie the instructions to make a whole human being. Those instructions are coded into our DNA molecules, long strings of chemical nucleotide bases represented by the letters A, T, G and C. When DNA sequencing allowed us to read these letters in the 1970s, a Japanese group suggested advanced races could have left a message for humanity in the genome of a common virus like phi X147 bacteriophage. No such message was found when the virus was sequenced, but the idea of DNA being used to encode messages stuck. Could DNA be the data storage of the future?

Computers and organic cells have a lot in common. In a computer, information is encoded in strings of numbers called bits, 1s and 0s that, when read, execute programs. In a cell, information is stored in the four nucleobase letters that produce proteins when read. Computer data is measured in bytes. There are eight bits in a byte, 1000 in a kilobyte, etc. Remember how an exabyte is basically a room full of books? Now imagine that each letter of DNA represented two bits of information where A = 00, T = 01, C = 10, and G = 11. In a DNA molecule, an exabyte of DNA-data could be stored in just a cubic millimeter.

Converting nucleotides to bits

Professor George Church at Harvard took the DNA data storage idea forward. In 2012, his team converted a 52,000-word book into strings of DNA. They proved the principle that DNA could store data, however they discovered that the method limited the amount of information the DNA could store. Because DNA can break and degrade, the theoretical limit of a single nucleotide is storing 1.8 bits of data. Church’s group achieved less than half of this capacity with their early method.

In 2017, Dr Yaniv Erlich and Dr Dina Zielinski of the New York Genome Centre made a breakthrough. Recognizing limitations in DNA synthesis, they converted six files into strings of binary code and developed an algorithm called a DNA fountain to process the information for DNA coding. The DNA fountain randomly separated the strings into “droplets” of DNA strings 200 base pairs long, a reasonable length for error-free DNA synthesis which can accrue errors after this length. The DNA strings were also flanked with tags to help reassemble the fragments. The digital DNA strands, 72,000 in total, were then sent to be synthesized.

Twist Bioscience, a leading large-scale DNA synthesis company, synthesized the DNA and sent the fragments back two weeks later. Erlich and Zielinski sent the DNA for sequencing and the code processed back into binary by a computer program using the tags as a guide to help reassembly. The result was perfect. Erlich estimated that their approach encoded 1.6 bits of information per nucleotide.

The Industry Approach

Converting DNA into data requires a lot of DNA, the synthesis of which is traditionally neither easy nor cheap. Twist Bioscience developed a scaled-up approach to DNA synthesis which is better suited to meet demand for DNA data storage. Microsoft and Twist partnered to set a record for DNA data storage of 200 megabytes in 2016. More recently, Microsoft and the University of Washington demonstrated a completely automated system to store and retrieve DNA data – in this case the word ‘hello’ – bringing the technology a step closer to its application in data centers.

These approaches use DNA bases to store information in strings, like strings of bits in a computer. However, this approach is still prohibitively expensive with current DNA synthesis costs. DNA is not infallible either. Missing a base, either in assembly or in reading the DNA strand, the data can become corrupted. If the technology is to be developed for reading and writing information as easily as computers, then these issues will need to be addressed. Fortunately, because DNA is a natural data storage system for our genetic blueprint, nature has evolved a range of protective measures to keep our DNA in order which have inspired a new approach to DNA data storage.

Earlier this year, DNA data storage company Catalog smashed Microsoft’s record in DNA data storage by coding all of Wikipedia in English into DNA. That’s 16 gigabytes of data. They did this by taking a completely different look at how DNA could store data. Rather than coding each letter as a combination of two bits of data, Catalog code several DNA letters in different combinations of bits, which they call “identifiers”. These are stored stably and can be rearranged, acting like a moveable typeface rather than simply encoding long data strings. Researchers can assemble these identifiers in the orders they need to encode data, allowing them to write DNA data at a rate of 4 megabits per second.

We won’t have DNA-based computers just yet. The key in improving technology is bringing down cost of DNA synthesis and reading down through automation. This is currently a slow process but may yet be useful for archiving data and making long-lasting backups. DNA is structurally suited to storing information for extended periods of time, given its half-life of 521 years. Perhaps we can build a DNA time capsule containing all of humanity’s current knowledge, including cat pictures, and blast it into space or bury it on Mars for future generations to find.

Back to news