The decoupling of compute and storage for Hadoop has been of the big takeaways and themes for Hadoop in 2015. BlueData has written some blog posts about the topic this year, and many of our new customers have cited this as a key initiative in their organization. And, as indicated in this tweet from Gartner’s Merv Adrian earlier this year, it’s been a major topic of discussion at industry events:
Last week I presented a webinar session with Chris Harrold, CTO for EMC’s Big Data solutions, where we discussed shared infrastructure for Big Data and the opportunity to separate Hadoop compute from storage. We had several hundred people sign up for the webinar, and there was great interaction in the Q&A chat panel throughout the session. This turnout and interest provides additional validation of the interest in this topic – it’s a clear indication that the market is looking for fresh ideas to cross the chasm with Big Data infrastructure.
Here’s a recap of some of the topic we discussed in the webinar (you can also view the on-demand replay here)
Traditional Big Data assumptions = #1 reason for complexity, cost, and stalled projects.
The herd mentality of believing that the only path to Big Data (and in particular Hadoop) is the way it was deployed at early adopters like Yahoo, Facebook, or LinkedIn has left scorched earth for many an enterprise.
The Big Data ecosystem has made Hadoop synonymous with:
- Dedicated physical servers (“just get a bunch of commodity servers, load them up with Hadoop, and you can be like a Yahoo or a Facebook”);
- Hadoop compute and storage on the same physical machine (the buzz word is “data locality” – “you gotta have it otherwise it’s not Hadoop”);
- Hadoop has to be on direct attached storage [DAS] (“local computation and storage” and “HDFS requires local disks” are traditional Hadoop assumptions).
If you’ve been living by these principles, following them as “the only way” to do Hadoop, and are getting ready to throw in the towel on Hadoop … it’s time to STOP and challenge these fundamental assumptions.
Yes, there is a better way.
Hadoop can run on containers or VMs. The new reality is that you can use virtual machines or containers as your Hadoop nodes rather than physical servers. This saves you time from racking, stacking, and networking those physical servers. You don’t need to wait for a new server to be ordered and provisioned; or fight deployment issues due to all the stuff that existed on a repurposed server prior to it being handed over to you.
With software-defined infrastructure like virtual machines or containers, you get a pristine and clean environment that enables predictable deployments – while also delivering greater speed and cost savings. During the webinar, Chris highlighted a virtualized Hadoop deployment at Adobe. He explained how they were able to quickly increase the number of Hadoop worker nodes from 32 to 64 to 128 in matter of days – with significantly superior performance to physical servers at a fraction of the cost.
Most data centers are fully virtualized. Why wouldn’t you virtualize Hadoop? As a matter of fact, all of the “Quick Start” options from Hadoop vendors run on a VM or (more recently) on containers (whether local or in the cloud). Companies like Netflix have built an awesome service based on virtualized Hadoop clusters that run in a public cloud. The requirement for on-premises Hadoop to run on a physical server is outdated.
The concept of data locality is overblown. It’s time to finally debunk this myth. Data locality is a silent killer that impedes Hadoop adoption in the enterprise. Copying terabytes of existing enterprise data onto physical servers with local disks, and then having to balance/re-balance the data every time the server fails, is operationally complex and expensive. As a matter of fact, it only gets worse as you scale your clusters up. The internet giants like Yahoo used this approach circa 2005 because those were the days of slow start 1Gbps networks.
Today, networks are much faster and 10Gbps networks are commonplace. Studies from U.C. Berkeley AMPLab and newer Hadoop reference architectures have shown that you can get better I/O performance with compute/storage separation. And your organization will benefit from simpler operational models, where you can scale and manage your compute and storage systems independently. Ironically, the dirty little secret is that even with compute/storage co-location, you are not guaranteed data locality in many common Hadoop scenarios. Ask the Big Data team at Facebook and they will tell you that only 30% of their Hadoop tasks run on servers where data is local
HDFS does not require local disks. This is another one of those traditional Hadoop tenants that is no longer valid: local direct attached storage (DAS) is not required for Hadoop. The Hadoop Distributed File System (HDFS) is as much a distributed file system protocol as it is an implementation. Running HDFS on local disks is one such implementation approach, and DAS made sense for internet companies like Yahoo and Facebook – since their primary initial use case was collecting clickstream/log data.
However, most enterprises today have terabytes of Big Data from multiple sources (audio, video, text, etc.) that already resides in shared storage systems such as EMC Isilon. The data protection that enterprise-grade shared storage provides is a key consideration for these enterprises. And the need to move and duplicate this data for Hadoop deployments (with the 3x replication required for traditional DAS-based HDFS) can be a significant stumbling block.
BlueData and EMC Isilon enable an HDFS interface that can accelerate time-to-insights by leveraging your data in-place and bringing it to the Hadoop compute processes – rather than waiting weeks or months to copy the data onto local disks. If you’re interested in more on this topic, you can refer to my colleague Tom Phelan (Co-Founder and Chief Architect at Blue Data) ’s session on HDFS virtualization presented at Strata + Hadoop World in New York this fall.
What about performance with Hadoop compute/storage separation?
Any time a new infrastructure approach is introduced for applications (and many of you have seen this movie before when virtualization was introduced the early 2000’s), the number one question asked is “What about performance?” To address this question during our recent webinar session, Chris and I shared some detailed performance data from customer deployments as well as from independent studies.
In particular, I want to specifically highlight some performance results from a BlueData customer that compared physical Hadoop performance against a virtualized Hadoop cluster (running on the BlueData software platform) using the industry standard Intel HiBench performance benchmark:
- Enhanced DFSIO: Generates a large number of reads and writes (read-specific or write-specific)
- TeraSort: Sort dataset generated by TeraGen (balanced read-write)
Testing by this customer (a U.S. Federal Agency lab) revealed that, across the HiBench micro-workloads investigated, the BlueData EPIC software platform enabled performance in a virtualized environment that is comparable or superior to that on bare-metal. As indicated in the chart above, Hadoop jobs that were I/O intensive such as Enhanced DFSIO were shown to run 50-60% faster; balanced read-write operations were shown to run almost 200% faster.
While the numbers may vary for others customers based on hardware specifications (e.g. CPU, memory, disk types), the use of BlueData’s IOBoost (application aware caching) technology – combined with the use of multiple virtual clusters per physical server – contributed to the significant performance advantage over physical Hadoop.
This same U.S Federal Agency lab executed the same HiBench benchmark to compare performance of a shared DAS-based HDFS system and enterprise-grade NFS (with EMC Isilon) utilizing the BlueData DataTap technology. DataTap brings data from any file system (e.g. HDFS DAS, NFS, Object Storage) to Hadoop compute (e.g. MapReduce) by virtualizing the HDFS protocol.
In general, all the tests across the board showed that enterprise-grade NFS delivered superior performance compared to a DAS-based HDFS system. This specific performance comparison validates that the network is not the bottleneck (10Gbps network was used) and that the 3x replication in a DAS-based HDFS adds overhead.
A new approach will add value at every stage of the Big Data journey.
Big Data is a journey and many of you continue to persevere through it, while many others are just getting started.
Irrespective of where you are in this journey, the new approach to Hadoop that we described in the webinar session (e.g. leveraging containers and virtualization, separating compute and storage, using shared instead of local storage) will dramatically accelerate outcomes and deliver significant additional Big Data value.
For those of you just getting started with Big Data and in the prototyping phase, shared infrastructure built on the foundation of compute/storage separation will enable different teams to evaluate different Big Data ecosystem products and build use case prototypes – all while sharing a centralized data set versus copying and moving data around. And by running Hadoop on Docker containers, your data scientists and developers can spin up instant virtual clusters on-demand with self-service.
If you have a specific use case in production, a shared infrastructure model will allow you to simplify your dev/test environment and eliminate the need to duplicate data from production. You can simplify management, reduce costs, and improve utilization as you begin to scale your deployment.
And finally, if you need a true multi-tenant environment for your enterprise-scale Hadoop deployment, there is simply no alternative to using some form of virtualization and shared storage to deliver the agility and efficiency of a Big Data-as-a-Service experience on-premises.
Key takeaways and next steps for your Big Data journey.
In closing, Chris and I shared some final thoughts and takeaways at the end of the webinar session:
- Big Data is a journey: future-proof your infrastructure
- Compute and storage separation enables greater agility for all Big Data stakeholders
- Don’t make your infrastructure decisions based on the “data locality” myth
We expect these trends to continue into next year and beyond: we’ll see more Big Data deployments leveraging shared storage, more deployments using containers and virtual machines, and more enterprises decoupling Hadoop compute from storage. As you plan your Big Data strategy for 2016, I’d encourage you to challenge the traditional assumptions and embrace a new approach to do Hadoop.
Finally, I want to personally thank everyone who made time to attend our recent webinar session. You can view the on-demand replay. Feel free to ping me (@AnantCman) or Chris Harrold (@CHarrold303) on Twitter if you have any additional questions.