1. Data Storage
You’ve probably seen the plot of Moore’s Law compared to sequencing throughput. In short, the cost of DNA sequencing has plummeted much faster than the cost of disk storage and CPU. A run on the Illumina HiSeq2000 provides enough capacity for about 48 human exomes. Even if you don’t keep the images, each exome requires about 10 gigabytes of disk space to store the bases, qualities, and alignments in compressed (BAM) format. At three runs a month, each instrument is generating 1.4 terabytes of data files. It adds up quickly.
Analysis of sequencing data—variant calling, annotation, expression analysis, genetic analysis—also requires disk space. Most non-BGI research budgets are finite, so investigators must choose between (1) deleting data, (2) spending money, or (3) holding up data production/analysis. None of those sound very appealing, do they?