The human body is an amazing machine and yet it is made up of just 20,000 gene sequences. Think about that, everything that makes us who we are is defined in 20,000 genes – unbelievable, really. The importance of our genome cannot be understated and so it is no surprise that scientists have extended a massive effort to sequence our genes and understand the basic building blocks of human life.
A complete sequence of the genome was accomplished at the end of 2000 and was announced by President Clinton with lots of fanfare and excitement. Yet, when we think back to the technology used, it seems quaint and even archaic. At the time, the Pentium was Intel’s flagship CPU and it was paired with hard drives storing a whopping 50GB! The world of computing has evolved since then with new technologies like multi-core CPUs, massive parallel processing architectures, high capacity data analytics and significant increases in I/O bandwidth.As computing has advanced, the field of genomics has followed. In 2000, we were thrilled to sequence 6 people and today, we have over 250,000 sequences which if stored in the same place would require 25PB of storage, a manageable capacity by today’s standards. However, the amount and frequency of sequencing will continue to accelerate. Michael Schatz, an associate professor at the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory estimates that by 2025, 1 billion people will have their genome sequenced resulting in exabytes of storage! Now that is big data, and it will only get bigger.
The challenge of managing genetic data extends beyond just storing the information. The greatest benefits come from analyzing the genes to understand trends across populations to improve medical diagnoses and treatments. Imagine the insights we could garner by comparing the genome of hundreds or even thousands of individuals suffering from a common ailment. The new knowledge could help establish genetic tendencies and lead to more effective use of current treatments and the creation of new ones. But analyzing all of the data in a consistent, timely and reliable manner will be a challenge.
Looking forward, genomic experts will need systems that are massive scalable and can rapidly run analyses on huge data sets. In practice, these problems are similar (although potentially larger in scope) to those faced by companies analyzing large volumes of machine or sensor information such as that generated by robo-trains. In either case, a dedicated big data solution like EMC’s data lake offering can provide a key building block to enable the creation of meaningful health or business insights.
So how does a Data Lake solution help? The critical challenge for these huge scale analyses is having access to a dynamic storage computing and analytics environment that will scale and can provide the performance needed to generate actionable insights. Open source technologies like Hadoop will be a key part of the solution as will advanced analytics and storage technologies such as those offered by EMC’s Data Lake solutions. Combining these solutions with powerful genomics research creates an opportunity for us to revolutionize medicine as we know it and I am excited to see what the future will bring.