In traditional Hadoop environments, the entire data set must be ingested (and three or more copies of each block made) before any analysis can begin. Once analysis is complete, results must then be exported. What’s the significance of this? COST. These are tedious and time-consuming processes, along with maintaining multiple copies of data. With EMC Isilon HDFS, the entire data set can start to be analyzed immediately without the need to replicate it, and the results are also available immediately to NFS and SMB clients.
If you don’t already own Isilon for your Hadoop environment, it is worth exploring the multitude of benefits Isilon brings over HDFS running on compute hosts. If you are already an Isilon customer, Isilon requires no data movement and instead offers in-place analytics on data, eliminating the need to build a specialty Hadoop storage infrastructure.
Ryan Peterson, Director of Solutions Architecture at Isilon, likes to say that Isilon dedupes Hadoop since Isilon satisfies Hadoop’s need to see multiple copies of the same data without having to actually copy it. In fact, with the latest release of Isilon’s OneFS 7.1 today, a new feature called Smart Dedupe can reduce the storage further by approximately 30%. Ryan Peterson now refers to this as Hadoop Dedupe Dedupe. The first ‘Dedupe’ removes 3x replication, and the second ‘Dedupe’ reduces storage by 30%. Clever!
I sat down with Ryan Peterson to walk us through Hadoop Dedupe Dedupe:
In a traditional Hadoop deployment, data loss resulting from hardware failure is handled by replicating blocks of data across a minimum of three times (3X by default), resulting in at least 4 data copies – existing primary storage plus 3 Hadoop storage copies.
Isilon for Hadoop turns this paradigm upside down because if existing primary data is NOT already on Isilon, then only 2.2 copies of data is required to protect against data loss due to hardware failure. The first copy is from the existing primary data NOT on Isilon, and the second copy is on Isilon. Isilon’s N+M RAID –like distributed parity scheme makes 1.2 copies while providing high availability and resiliency to protect from data loss due to hardware failure (i.e. nodes and disks). I
If primary data is already on Isilon there’s no need for a separate Hadoop storage infrastructure in the first place, and only 1.2 data copies are made instead of 4. With the upcoming release of Isilon’s de-duplication feature, the storage requirements will go down further by approximately 30%.
So if customers have 300TB of raw data, they will need 900TB of new storage to run their Hadoop cluster. However if they already have this data on Isilon, they will not need any new storage and will only have 252TB of raw data to work with because data in primary is de-duped and they can run Hadoop directly on that data.
Wait a minute, is this Hadoop Dedupe Dedupe Dedupe?