After five years, more than 350,000 hours of genome sequencing, and over £200 million of investment, UK Biobank is releasing the world’s largest-by-far single set of human sequencing data—completing the most ambitious project of its kind ever undertaken. The new data, whole genome sequences of its half a million participants, will certainly drive the discovery of new diagnostics, treatments, and cures. Uniquely, the data are available to approved researchers worldwide, via a protected database containing only de-identified data.
This advance lies not only in the abundance of genomic data, but its use in combination with the existing data UK Biobank has collected over the past 15 years on lifestyle, whole body imaging scans, health information, and proteins found in the blood. The Pharma Proteomics Project was published last month in Nature, in the paper, “Plasma proteomic associations with genetics and health in the UK Biobank.”
Looking forward, these data could be used to further advance efforts such as more targeted drug discovery and development, discovering thousands of disease-causing noncoding genetic variants, accelerating precision medicine, and understanding the biological underpinnings of disease.
“This is a veritable treasure trove for approved scientists undertaking health research, and I expect it to have transformative results for diagnoses, treatments, and cures around the globe,” said Sir Rory Collins, FRS FMedSci, principal investigator at UK Biobank.
Roughly 20 years ago, the UK Biobank recruited half a million volunteers to create the world’s most comprehensive source of health data. The new addition of sequencing data comes after a series of great leaps made using the vast UK Biobank biomedical database. These leaps include: finding genes associated with protection against obesity and type 2 diabetes, identifying individuals at very high genetic risk for diseases such as heart disease, breast cancer, and prostate cancer, and a link between activity and Parkinson’s that can predict the disease up to seven years before diagnosis from smartwatch data, potentially leading to early intervention. The new sequencing data will dramatically enhance the existing data’s potential.
The sequencing project was funded by Wellcome, UKRI, and four biopharmaceutical companies: Amgen, AstraZeneca, GSK, and Johnson & Johnson. In return for significant investment, UK Biobank gives nine months’ exclusive data access to industry members of the consortium. The DNA sequencing was completed by Amgen’s subsidiary, deCODE Genetics, and the Wellcome Sanger Institute, using Illumina NovaSeq technology, and with deCODE providing additional informatics processing support.
The four pharmaceutical companies plan to publicly share their summary statistical analyses arising from the consortium collaboration, including genome-wide association results, providing the research community with highly valuable insights without the costly and time-consuming burden of analyzing raw data.
The data—and the rest of UK Biobank’s de-identified data—is now globally accessible for approved researchers on the UK Biobank Research Analysis Platform which is hosted on Amazon Web Services (AWS) in the London region and enabled by DNAnexus. Following completion of the sequencing, the industry consortium led efforts to process and joint call the genomes using the DRAGEN pipeline on AWS infrastructure, enabling this vast volume of data to be transformed into a single combined genetic dataset by Illumina. These outputs further enhance the potential of these data to identify less frequent genetic variants and make it more cross-comparable with other large-scale population health studies.