Running Hadoop on bare metals may fit some use cases, but many organizations have the types of data workloads that demand more storage than compute resources. In order to get the most efficient utilization for these types of Hadoop workloads, separating the compute and storage resources makes sense and is a configuration users are asking for, i.e EMC Isilon storage for Hadoop. In response to diverse Big Data needs such as mixed Hadoop workloads, hybrid cloud models, and heterogenous data layers, Rackspace recently delivered new Big Data hosting options whereby users now have more choices – run Hadoop on Rackspace managed dedicated servers, spin up Hadoop on the public cloud, or configure your own private cloud.
I spoke with Sean Anderson, Product Marketing Manager for Data Solutions at Rackspace, to talk about one particular new offering called ‘Managed Big Data Platform’ whereby customers can design the optimal configuration for their data, and leave the management details to Rackspace.
1. The Big Data lifecycle will go through various stages, with each stage imposing different requirements. From what you are seeing with your customers, can you explain this lifecycle and what value Rackspace brings to support this lifecycle?
There is a pattern that we have seen with customers that are evaluating Hadoop. The first stage is technology validation whereby they start off with a sandbox environment usually on a local computer based on a business case. If the appropriate Map Reduce, Pig, or Hive jobs for example can provide the insights needed to meet the business goal, the technology validation is complete and they can move to the next phase which is the Proof of Concept or Proof of Value stage. At this stage, most customers choose a cloud or hosted service to minimize risk and costs. Once the POC is validated, the Hadoop distribution is selected, and the architecture is defined, customers will implement solution on a hosted service or on premise service (private cloud).
The way Rackspace supports this lifecycle is that we do not dictate what technology platform is deployed or how it is architected. For example, during the POC phase, we support a Hadoop cloud service or a Hadoop hosted or private cloud service using bare metal or different storage options.
2. The ‘Managed Big Data Platform’ service is a dedicated environment whereby users can customize storage devices, platforms, architectures, and network designs. Describe some of the features and benefits of this offering.
With Hadoop in general there are about 40 validated use cases whereby the amount of computing versus storage resources vary and therefore an architecture must be designed at a more granular level to meet specific workloads, as well as compliance requirements. The ‘Manage Big Data Platform’ provides the flexibility needed to design this granular architecture, with the expert advice and support from Rackspace. So for example, you can scale your Hadoop compute and storage independently with the EMC Isilon Storage option. For compliance requirements, you can add additional network security layers with different firewall and intrusion detection options.
3. What technologies make up the ‘Managed Big Data Platform’ service?
We have a technology portfolio to support an optimal configuration for Hortonworks Data Platform that includes dedicated servers, storage platforms such EMC Isilon and VNX, and network security such as firewalls and intrusion detection.
4. What are the recommended reference architectures or configurations for the ‘Managed Big Data Platform’ service?
We have released 4 reference architectures for customers to start thinking about how to architect Hadoop. Two are bare metal, one small and one large. The other two are with supplied storage – EMC Isilon and EMC VNX. These reference architectures are based on current Hadoop deployments, internal testing, and recommendations from our partners Hortonworks and EMC.
5. For the EMC Isilon storage configuration, what use cases will benefit most?
Many use cases have resonated with customers. The first one is when Hadoop is part of a bigger data strategy whereby you can use Isilon as a data repository or data lake, parsing out some of the data services to Hadoop. The second is for persistent Hadoop workloads where data lives locally on Isilon so you can perform Hadoop operations faster, without the need to ingest. The last is when disaster recovery is needed since Isilon provides the snapshot and replication technology for maximum data protection over bare metal Hadoop deployment.