Architecting a Data Lake: Matching Technology with Your Harvesting Needs

It takes many different best-of-breed technologies to effectively harvest “game-changing” analytics value from the data lake. Getting the right architecture to navigate your data lake requires a deep understanding of both the needs of Big Data and the available technologies in order to match analytics use cases with the appropriate platforms to get results.

Do you need to analyze large amounts of data fast or process many queries simultaneously? Is the data you are using organized in columns and rows, customer records perhaps? Or are you searching document files?

Let’s look at the basics of data lake architecture, some of the technologies and tools you should consider, and how EMC IT is approaching this crucial process.

Data Lake: Core Architectures

blog-darryl-smith-DL-architecture-graphic-3

The data lake architectures you may be most familiar with are Hadoop and Greenplum, which make up the core of our data lake at EMC IT.  Hadoop Distributed File System (HDFS) is open-source software which takes commodity servers and turns them into a large data store—the data lake. Data in a Hadoop cluster is broken down into smaller pieces, called blocks, and distributed throughout the cluster. This allows MapReduce functions to be executed on smaller subsets of your larger data sets and that provides the scalability need for Big Data processing.

Hadoop can process extremely large amounts of both structured and unstructured data to provide analytics on very large data sets.  It is primarily aimed at data science activities, using models to search for trends and patterns in vast amounts of data. In other words, data scientists rely on Hadoop to run various models to try and discover unknown things in the data.

A good example is the “call home” data we get from our install base of EMC products. Using HDFS to analyze this unstructured data that we regularly get from all of our devices in the field, we are able to discover things like hard-disk failure rates.

Greenplum is an open-source relational database geared to processing structured and semi-structured data—to be extracted for business analytics. Unlike data science, business analytics is based more on “what if” type of queries about existing data rather than trying to discover the unknown. For instance, Greenplum is geared to questions that will help you know more about your customers, or how best to focus marketing efforts by analyzing customer data sets.

Harvesting Data Subsets

Beyond these core architectures, is the execution tier of the data lake—platform technologies around the “edge” of the lake designed to more efficiently use subsets of the data for specific types of analysis. Execution tier technologies allow you to handle more queries with faster response time versus fewer queries on larger sets of data.

While this is definitely a non-lake visual, think of Hadoop and Greenplum as the tractors tilling the fields, or the work horses, and the execution tiers as the delivery trucks transporting the produce.

In EMC IT’s data lake, we use several best-of-breed technologies in our execution tier:

• PostgreSQL— the number one open source database that is extremely capable of handling many tasks very efficiently (or managing concurrency).
• MongoDB—an in-memory document database.
• Cassandra—particularly good for querying large amounts of structured or semi-structured data such as sensor logs.
• MemSQL—an online database for both row and column data that provides very low latency responses to structured data.

All of these platforms are used so that developers can build applications that utilize the data that is in the data lake, to gain insight in specific areas. These are just some of the many technologies in the data lake execution tool box. These tools and platforms will evolve as both your data analytics needs and the technology changes.

Leveraging SAN

There are some software technologies that can help your data lake architecture perform better. For example, using a server-based storage area network (SAN) with Hadoop provides data capabilities, flexibility and speed at a lower cost than expensive storage arrays.
For our SAN we use EMC’s ScaleIO, a software-defined storage product that takes the complexity of commodity servers out of the hands of the data center staff. ScaleIO takes commodity servers and pools all the storage together as a server-based SAN. It is a major component of our data lake, providing compute and storage to our Hadoop client nodes as well as platforms like Cassandra.

Requirements for an optimum data lake execution tier are fairly simple: You need large amounts of data fast and you need to be able to expand a platform fast. These are perfect use cases for ScaleIO.

We also use Isilon, EMC’s scale-out Network Attached Storage (NAS) storage platform, to manage our HDFS tier and provide fast access to our data lake. One of the many protocols supported, by the Isilon appliance is HDFS. Utilizing Isilon for HDFS allows us to utilize the power of Hadoop, without the complexity of physical infrastructure.

So as you grow your data lake, be aware that there are many evolving proprietary and open-source technologies that can help you get the most out of your Big Data.

If you are attending EMC World 2016, May 2-5 in Las Vegas, be sure to check out our session on NoSQL and Modern Analytics on May 2 from 4:30 to 5:30 p.m., featuring Darryl Smith, Ramesh Razdan and Tarik Dwiek.

About the Author: Darryl Smith