Since Hadoop’s inception, both data analytics and analytics infrastructures have grown and evolved tremendously. There is currently a paradigm shift underway poised to further transform the way that enterprises manage their rapidly-expanding data and supporting infrastructure.
The paradigm shift is taking place around analytics architectures – specifically Hadoop-based architectures – with respect to deployment. Let’s begin by looking at how many analytics projects get started within an enterprise.
How Analytics Projects Start
Most projects start with some type of discovery process. Take, for example, a team of scientists looking to find that “golden nugget” within their data which might offer a significant value-add to their business. These scientists might work with their IT department to set up a small Hadoop cluster, to load the data and begin the iterative process of data visualization, cleansing and testing until the hypothesis is either validated or disproven. If the hypothesis is proven, the ideas are implemented and eventually go live – success!
What happens next – continuing with the example of our successful scientists – is that other departments take notice of that success, and want to utilize the data platform so they too can benefit. In most cases, they’ll not only use the existing data in the cluster, but also bring in new external or internal data and new applications for their projects. This starts to increase the physical size of the infrastructure.
The issue we hear most from customers at this point is how to effectively grow and manage that ever-expanding cluster as more data and applications continue to go into it. The inherent problem with such expanding architectures is that they become complex very quickly and too complicated to maintain. There are a lot of spinning discs, the unavoidable 3x data replication to maintain and these clusters will very quickly grow to be very cost inefficient.
How to Manage Their Growth
There are three potential approaches to consider:
- Add more nodes, through direct attached storage (DAS), adding both compute and storage
- De-couple compute and storage, through network attached scale out storage (NAS)
- A hybrid architecture (i.e., a tiered storage Hadoop architecture)
With traditional Hadoop deployments, the primary way to expand the infrastructure is through direct attached storage (DAS), or by adding more nodes. This incorporates more storage and compute power at the same time. The challenge here is that as the number of applications grows, and the amount of data grows, data access patterns change. Different applications require different performance environments – some need more compute power, others are more storage dependent.
Adding more nodes can be inefficient if you only need more storage – you’ll overspend on compute power you might not need. Likewise, if you only need compute power, you’re also potentially purchasing storage that will go unused.
Which brings us to the second approach. As the number of clusters grows in an enterprise, the inherent complexity of maintaining them grows as well, and that complexity doesn’t always grow linearly. By separating compute and storage, and incorporating an enterprise-grade storage solution that can perform data-level functions such as governance, security, encryption, user management, data access patterns, multitenancy, much of the inherent complexity associated with growing data volumes can be mitigated.
This approach allows users to adjust to shifting application performance requirements as necessary. For example, if more compute power is needed, add more servers. In a virtualized environment, you can simply spin up more compute nodes to address compute challenges. This elastic type virtualized environment can be implemented far more quickly than deploying physical hardware.
The same is true for storage-dependent applications. For those apps, moving away from traditional DAS infrastructures and leveraging network-attached storage (NAS) allows you to leave compute power as is and focus on increasing storage. This is cost effective since you don’t have to pay for compute resources – it also reduces the data footprint in the data center, including costs and resources associated with cooling and maintenance of server hardware.
The third option is a hybrid tiered storage environment, which addresses many of the challenges brought about by the paradigm shift mentioned earlier – it also embodies the idea of separating storage from compute. However, it also allows enterprises to tier their data based on its temperate. For example, “hot” (frequently accessed) data or “cold” (archived, less accessed) data. The longer that data exists, the more it cools in temperature – what makes sense in this type of an environment is to have a cheaper, deeper NAS solution extending your Hadoop mainspace. Enterprises can increase their storage footprint by adding less expensive NAS environments that don’t require data replication, essentially providing cheaper archives, without adding compute.’
How to Take Advantage of Data Lakes
Bringing in a true, multiprotocol NAS product into the Hadoop environment enables enterprises to take advantage of that magic “DL” word: data lakes.
Quickly, I am defining a data lake as a means to store data that will allow an organization to have full, multi-tenant, secure and scalable access to all of its data, all the time throughout the organization’s requirements. In other words, a data lake allows access to any and all applications, regardless of the connectivity requirements for that data, all the while maintaining the data in one central location.
Data lakes provide access to the “Three V’s of Big Data”:
- Velocity – You can grow and manage your compute size based on application demand
- Volume – You can handle the data volumes coming in
- Variety – You can manage different data sources providing different types of data, which require a variety of gateways to applications to connect and access that data securely
As data volumes continue to expand exponentially, and Hadoop analytics architectures also continue to grow within the enterprise, the demand for enterprise-grade, highly available, highly secure, extremely scalable storage architecture will only continue as well. My recommendation is to consider either decoupling your storage and compute, or looking at a tiered Hadoop storage architecture.