Could you imagine storing all of your digital photos, audio, documents, and other files as DNA? Scientists are already demonstrating how writing image and text files in DNA could in principle revolutionize how we store and archive data, but while the technology is developing—there is still has a long way to go before it might become reality for everyday use—finding a practicable way of identifying and retrieving files from potentially massive archived stores is a key challenge.

Scientists at the Massachusetts Institute of Technology (MIT), the Broad Institute of MIT and Harvard, and the Koch Institute for Integrative Cancer Research at MIT, have now developed a new technique for labeling and retrieving DNA data files from large pools. The new approach doesn’t rely on conventional PCR to find and amplify the files. Rather, the technique involves encapsulating DNA files into DNA-barcoded silica particles.

“We need new solutions for storing these massive amounts of data that the world is accumulating, especially the archival data,” suggested Mark Bathe, PhD, an MIT professor of biological engineering, and an associate member of the Broad Institute of MIT and Harvard. “DNA is a thousandfold denser than even flash memory, and another property that’s interesting is that once you make the DNA polymer, it doesn’t consume any energy. You can write the DNA and then store it forever.”

Bathe is senior author of the team’s published paper in Nature Materials, titled, “Random access DNA memory using Boolean search in an archival file storage system.” Lead authors of the paper are MIT senior postdoc James Banal, PhD, former MIT research associate Tyson Shepherd, PhD, and MIT graduate student Joseph Berleant.

On Earth right now, there are about 10 trillion gigabytes of digital data, and every day, humans produce emails, photos, tweets, and other digital files that add up to another 2.5 million gigabytes of data. Much of this data is stored in enormous facilities known as exabyte data centers (an exabyte is 1 billion gigabytes), which can be the size of several football fields and cost around $1 billion to build and maintain.

Many scientists believe that an alternative solution lies in the DNA molecule that contains our genetic information. After all, DNA has evolved to store massive quantities of information at very high density. A coffee mug full of DNA could theoretically store all of the world’s data, Bathe suggested.

Scientists have already demonstrated that they can encode images and pages of text as DNA. However, an easy way to pick out the desired file from a mixture of many pieces of DNA will also be needed. Bathe and his colleagues have now demonstrated one way to do that, by encapsulating each data file into a 6-μm particle of silica, which is labeled with short DNA sequences that reveal the contents. Using this approach, the researchers demonstrated that they could accurately pull out individual images stored as DNA sequences from a set of 20 images. Given the number of possible labels that could be used, this approach could scale up to 1020 files.

Digital storage systems encode text, photos, or any other kind of information as a series of 0s and 1s. This same information can be encoded in DNA using the four nucleotides that make up the genetic code: A, T, G, and C. For example, G and C could be used to represent 0 while A and T represent 1. In fact, the authors pointed out, “While DNA is the polymer selected by evolution for the storage and transmission of genetic information in biology, it can also be used for the storage of arbitrary digital information at densities far exceeding conventional data storage technologies such as flash and tape memory, at scales well beyond the capacity of the largest existing data centers.”

DNA has several other features that make it desirable as a storage medium. It is extremely stable, and it is fairly easy (although currently expensive) to synthesize and sequence. Also, because of its high density—each nucleotide, equivalent to up to two bits, is about 1 cubic nanometer—an exabyte of data stored as DNA could fit in the palm of a hand.

Scientists have already demonstrated the viability of using DNA as a general information storage medium for the storage and retrieval of books, images, computer programs, audio clips, works of art, and Shakespeare’s sonnets, using different encoding schemes, the team reported. In fact, the size of data is limited primarily by the cost of DNA synthesis, and this is one key obstacle to this kind of data storage. Currently, it would cost $1 trillion to write one petabyte of data (1 million gigabytes). Which means, Bathe estimates, that to become competitive with magnetic tape, which is often used to store archival data, the cost of DNA synthesis would need to drop by about six orders of magnitude. He anticipates that this will happen within a decade or two, similar to how the cost of storing information on flash drives has dropped dramatically over the past couple of decades. “Recent progress in nucleic acid synthesis and sequencing technologies continues to reduce the cost of writing and reading DNA, foreshadowing future commercially competitive DNA-based information storage,” the authors further pointed out.

Aside from the cost, the other major bottleneck in using DNA to store data is the difficulty in picking out the file you want from all the others. “Assuming that the technologies for writing DNA get to a point where it’s cost effective to write an exabyte or zettabyte of data in DNA, then what? Bathe said. “You’re going to have a pile of DNA, which is a gazillion files, images or movies and other stuff, and you need to find the one picture or movie you’re looking for. It’s like trying to find a needle in a haystack.”

DNA files are conventionally retrieved using PCR. Each DNA data file includes a sequence that binds to a particular PCR primer. To pull out a specific file, that primer is added to the sample to find and then amplify the desired sequence. However, one drawback to this approach is that there can be crosstalk between the primer and off-target DNA sequences, leading unwanted files to be pulled out. Also, the PCR retrieval process requires enzymes and ends up consuming most of the DNA that was in the pool. “You’re kind of burning the haystack to find the needle, because all the other DNA is not getting amplified and you’re basically throwing it away,” Bathe said.

As an alternative approach, the MIT team developed a retrieval technique that involves encapsulating each DNA file into a small silica particle. Each capsule is labeled with single-stranded DNA “barcode” that corresponds to the contents of the file. “As an alternative to PCR-based approaches, here we introduce a direct random access memory approach that retrieves specific files, or arbitrary subsets of files, directly using physical sorting, without a need for amplification, and without any potential for barcode–memory crosstalk, while also preserving non-selected files intact by recycling them into the original memory pool,” the investigators explained.

To demonstrate their approach in a cost-effective manner, the researchers encoded 20 different images into pieces of DNA about 3,000 nucleotides long, which is equivalent to about 100 bytes. (They also showed that the capsules could fit DNA files up to a gigabyte in size.) The files were each tagged with barcodes corresponding to labels such as “cat” or “airplane.” When the researchers wanted to pull out a specific image, they removed a sample of the DNA and added primers that corresponded to the labels they were looking for—for example, “cat,” “orange,” and “wild” for an image of a tiger, or “cat,” “orange,” and “domestic” for a housecat.

The primers were labeled with fluorescent or magnetic particles, making it easy to pull out and identify any matches from the sample. This allowed the desired file to be removed while leaving the rest of the DNA intact to be put back into storage. The retrieval process also allows Boolean logic statements such as “president AND 18th century” to generate George Washington as a result, similar to what is retrieved with a Google image search. “Downstream file selection may then be optical, physical or biochemical, with sequencing-based read-out following de-encapsulation of the memory DNA from the silica capsule,” the team continued.

For their barcodes, the researchers used single-stranded DNA sequences from a library of 100,000 sequences, each about 25 nucleotides long, developed by Stephen Elledge, PhD, a professor of genetics and medicine at Harvard Medical School. If you put two of these labels on each file, you can uniquely label 1010 (10 billion) different files, and with four labels on each, you can uniquely label 1020 files.

“At the current state of our proof-of-concept, we’re at the 1 kilobyte per second search rate,” Banal acknowledged. “Our file system’s search rate is determined by the data size per capsule, which is currently limited by the prohibitive cost to write even 100 megabytes worth of data on DNA, and the number of sorters we can use in parallel. If DNA synthesis becomes cheap enough, we would be able to maximize the data size we can store per file with our approach.” Bathe envisions that this kind of DNA encapsulation could be useful for storing “cold” data, which is data that is kept in an archive and not accessed very often.

In their paper, the team noted that file protection by silica encapsulation offers “… millennium-scale storage of immutable data, such as astronomical image databases, high-energy physics datasets, or high-resolution deep ocean floor mapping.” They also pointed out that because the system is not limited to synthetic DNA, it could be used for compact and energy-efficient long-term storage of bacterial, human, and other genomes for archival sample preservation and retrieval. Bathe commented, “While it may be a while before DNA is viable as a data storage medium, there already exists a pressing need today for low-cost, massive storage solutions for preexisting DNA and RNA samples from COVID-19 testing, human genomic sequencing, and other areas of genomics.”

The authors acknowledged that technical limitations with the system will still need to be overcome. Nevertheless, they concluded, “Our file system overcomes several challenges associated with preexisting PCR-based file systems, including obviating the need for numerous heating and cooling cycles and enzymatic synthesis, and eliminating nonspecific crosstalk between file sequences and barcodes, while enabling arbitrary Boolean logical search queries.”

Bathe’s lab is spinning out a startup, Cache DNA, that is now developing technology for long-term storage of both DNA, both for DNA data storage—in the long-term—and clinical and other preexisting DNA samples in the near-term.