Greenplum Database and MADlib Are Best When They Work Together

When I first heard of MADlib, the first things that came to my mind were the comical game and the hip-hop rapper. In the context of Big Data, MADlib is actually an open source project for Magnetic, Agile, Deep (MAD) data analysis, an orthogonal approach to traditional Enterprise Data Warehouses and Business Intelligence. The primary goal of the MADlib is to accelerate innovation in the Data Science community via a shared library of scalable in-database analytics.

MADlib

One of the strong supporters and contributors to MADlib is EMC Greenplum, as MADlib is currently ported to the Greenplum Database, as well as the PostgreSQL Database. Since I am employed by EMC, I had the luxury of chatting with MADlib Architect Caleb Welton and MADlib Product Manager Gaurav Kumar over coffee in the Greenplum Break-room.

Describe how MADlib is Magnetic, Agile, and Deep?

‘Magnetic’ refers to a system that attracts data and people to it. Traditional data warehousing tend to “repel” new data sources until the data is clean and integrated. MADlib algorithms embrace the ubiquitous nature of data by helping data scientists make inferences about data even if the data source has not undergone rigorous validation.

‘Agile’ is about enabling Data Scientists to quickly and easily experiment with the data, derive insight, refine the model, and iterate again. This requires fast ad hoc analysis and sandboxing to be productive which is made possible with the massive amounts of parallelism and scalability provided by the EMC Greenplum Database.

‘Deep’ comes from the capability to analyze an entire data set at scale rather than being forced to take a sample of your entire data set that is small enough to fit in memory on a single machine.

What makes MADlib clever like the game?

MADlib has been designed ‘cleverly’ from the ground up for big data, by leveraging the advances in parallel processing and computing architecture. For example, machine-learning algorithms in MADlib are designed to optimally run on terabytes to petabytes of data.. Machine learning and predictive analytics have been around since the 1970s, and several companies like P&G and Coca-Cola have been using machine-learning techniques to understand consumer behavior. However, the way these techniques are currently used is often limited by the fact they are optimized for single machine computing platforms and do not work well with terabytes and petabytes of data.

What makes MADlib cool like the rapper?

It is cool because it is free, open-source software to allow for customization of machine learning algorithms. For example, Data Scientists are free to write their own version of k-means clustering and not tied into any proprietary algorithm. MADlib was born at UC Berkeley so the deep University origin allows us to leverage all the latest research from universities. For example, researchers at University of Wisconsin-Madison have recently contributed their research on Logistic Regression which is now an official part of MADlib.

Where does EMC Greenplum fit in?

Greenplum Database and MADlib are best when they work together.  They are the perfect pair. MADlib algorithms have been written to leverage the parallel architecture of Greenplum’s Massivel Parallel Processing (MPP) Database.

What is your favorite MADlib feature and why?

It has to be the open-source nature of MADlib since it is exciting to see a very active community continually making contributions to the software.

How do MADlib users contribute to MADlib community?

Anyone can contribute through gitHub, a social, collaborative coding platform.

When it comes to analyzing data and building data models, many users prefer to use their favorite tools such as SAS, MicroStrategy, R. Is MADlib a replacement of these tools?

Not at all. MADlib can work hand-in-hand with your current BI ecosystem to access, query, and visualize the data. For example, a leading smart grid company is using Tableau as the visualization tool for the Greenplum Database and MADlib predictive analytics solution.

How can people get started with MADlib?

You can download the Greenplum Database Community Edition and MADlib for free.

About the Author: Mona Patel