While my last blogs encouraged taking advantage of new technologies and not being constrained by
we’ve always done things”, for this blog I’ll emphasize the wisdom that can come from looking backwards.
Divide and conquer is an old strategy that has been applied effectively to many situations and problems. In politics, splitting the opposition, by offering deals that appeal to individual groups or subsets of the opposition, can enable successful policy implementations that would normally have united the opposition and prevented progress. Militarily, victory can be achieved by avoiding an enemy’s combined strength, by engaging the enemy repeatedly in smaller battles and whittling down the enemy’s fighting capacity.
It is not often that politics, warfare, computing, and storage all intersect, but in this case, leveraging an age old strategy can help us gain insights into today’s seem
ingly intractable problems.
The Hadoop infrastructure for analytics is built on this simple premise. In a Hadoop deployment, multiple commodity nodes are clustered together in a shared nothing environment (i.e. each node can directly access only its own storage, and storage on other nodes is only accessible via the network). A large data set is written to the Hadoop environment, typically copied from an online transaction processing system. Within the Hadoop environment, the task is processed as a series of independent “map” jobs, which process a small chunk of the data, purely local to a node, but with many such “map” jobs running concurrently (the “divide” part of divide and conquer). The final results are then combined together in a “reduce” phase, which combines all the smaller results together to produce the final output (e.g. combining all the sorted subchunks from a quicksort algorithm to produce the entire sorted list).
The Hadoop style of processing is an elegant solution to the problem of making sense of today’s reams of data and translating them into useful information. However, a typical implementation involves the Hadoop system as another storage system, composed of a cluster of nodes, storing protected copies of data. The Hadoop environment is optimized for batch processing of the data, rather than for normal data access, and scales most effectively when the size of each processing unit is measured in the 10s to 100s of MBs or larger, often requiring multiple data items to be combined for Hadoop processing rather than allowing the natural data size to be used.
The description of a Hadoop cluster should sound somewhat familiar, as a cluster of commodity nodes, with individual disk farms on each node, connected via a high speed network, are the exact components of a modern object storage system. However, object storage systems are optimized for both batch and interactive data access, are available to ingest and store both active, online data and older archive data, and are engineered to avoid bottlenecks when storing any size object, even objects as small as individual purchase records of perhaps a few kilobytes.
The recent ViPR 1.1 release leverages the commonality between the world of Hadoop and the world of object storage to deliver a solution which combines the best of both. The object storage platform delivers a highly scalable storage environment that provides high-speed access to all data, regardless of its natural size, and without the need to copy the data from the object store to a secondary storage platform. A Hadoop file system (HDFS) implementation has been layered on top of the object store, providing the enabling mechanisms for the Hadoop framework to identify where each data object is stored and to run the “map” jobs in a highly efficient manner, as local as possible to the object. There is no additional storage infrastructure to manage, and the data can be viewed either through an HDFS lens or via an object lens, depending on the needs of the moment.
As we continue to rethink storage, there will be more such opportunities to combine old ideas with new technologies to produce real value for today’s customers. Hadoop is a novel application of the tried and true “divide and conquer” strategy, and, when combined with the new storage paradigm of a scale-out object storage system, produces an analytical framework that avoids the unnecessary overheads of a dedicated analytics cluster and the unnecessary costs of transferring and reformatting data.