It might seem far-reaching to say that big data analysis can fundamentally impact patient outcomes around cancer and other illnesses, and that it has the power to ultimately transform health services and indeed society at large, but that’s the precise goal behind the 100,000 Genome Project from Genomics England.
For background, Genomics England is a wholly-owned company of the Department of Health, set up to deliver the 100,000 Genomes Project. This exciting endeavor will sequence and collect 100,000 whole genomes from 70,000 NHS patients and their families (with their full consent), focusing on patients with rare diseases as well as those with common cancers.
The program is designed to create a lasting legacy for patients as well as the NHS and the broader UK economy, while encouraging innovation in the UK’s bioscience sector. The genetic sequences will be anonymized and shared with approved academic researchers to help develop new treatments and diagnostic testing methods targeted at the genetic characteristics of individual patients.
Dell provides the platform for large-scale analytics in a hybrid cloud model for Genomics England, which leverages our VCE vScale, with EMC Isilon and EMC XtremIO solutions. The Project has been using EMC storage for its genomic sequence library, and now it will be leveraging an Isilon data lake to securely store data during the sequencing process. Backup services are provided by EMC’s Data Domain and EMC Networker.
The Genomics England IT environment uses both on-prem servers and IaaS provided by cloud service providers on G-Cloud. According to an article from Government Computing, “one of Genomics England’s key legacies is expected to be an ecosystem of cloud service providers providing low cost, elastic compute on demand through G-Cloud, bringing the benefits of scale to smaller research groups.”
There are two main considerations from an IT perspective around genome and DNA sequencing projects such as those being done by Genomics England and others: data management and speed. Vast amounts of research data have to be stored and retrieved, and this large-scale biologic data has to be processed quickly in order to gain meaningful insights.
Scale is another key factor. Sequencing and storing genomic information digitally is a data-intensive endeavor, to say the least. Just sequencing a single genome creates hundreds of gigabytes and the Project has sequenced over 13,000 genomes to date, which is expected to generate ten times more data over the next two years. The data lake being used by Genomics England allows 17 petabytes of data to be stored and made available for multi-protocol analytics (including Hadoop).
For perspective, 1 PB is a quadrillion bytes – think of that as 20 million four-drawer filing cabinets filled with text. Or, considering the Milky Way has roughly two hundred billion stars in its galaxy, if you count each single star as a single byte – it would take 5,000 Milky Way galaxies to reach 1PB of data. It’s staggering.
The potential of being able to contribute to eradicating disease and identify exciting new treatments is truly awe inspiring. And considering the immense scale of the data involved – 5,000 galaxies! – provides new context around reaching for the stars.