A mature data management industry based primarily on relational database (RDBMS) technology has been established over the last two decades. However, the emergence of the Big Data phenomenon, characterized by the 3Vs (Volume, Velocity, and Variety) of data and agile development of data-driven applications, has introduced a new set of challenges, and a variety of technologies have emerged to address them. Benchmarks provide a method for comparing performance of systems and are often used for evaluating suitability of these systems for procurement. The advent of new techniques and technologies for Big Data creates the imperative for industry standard benchmarks for evaluating such systems.
Big Data systems are characterized by their flexibility in processing diverse data genres, such as text, images, video, geo-locations, and sensor data, using a variety of methods. Because of the many sources and methods of analyzing Big Data, a single benchmark that characterizes all use-cases could not exist. However, a study of several Big Data platform use-cases indicates that most workloads are composed of a common set of stages, which capture the variety of data genres and algorithms commonly used to implement most data-intensive end-to-end workloads.
Beginning in late 2011, the Center or Large-scale Data Systems Research (CLDS) at the San Diego Supercomputer Center (SDSC), in collaboration with several industry players, initiated a community activity in Big Data benchmarking. The goal was to define reference benchmarks that capture the essence of Big Data application scenarios and to help characterize and understand hardware and system performance and the price-to-performance ratio of Big Data platforms. Founding members of this benchmarking initiative include Dr. Chaitan Baru (CLDS), Raghunath Nambiar (Cisco), Meikel Poess (Oracle), and Tilmann Rabl (University of Toronto), in addition to Greenplum, a division of EMC.
As a result of these initial activities, a Workshop Series on Big Data Benchmarking (WBDB) was organized, sponsored by the National Science Foundation. These workshops and associated meeting series validated the initial ideas for a Big Data benchmark to include definitions of the data along with a data generation procedure; a representative workload for emerging Big Data applications; and a set of metrics, run rules and full disclosure reports for fair comparisons of technologies and platforms.
A formal specification of this benchmarking suite is under-way, and will be announced at O’Reilly’s Strata Conference on February 28, 2013, in a session I will conduct with Chaitan Baru. We will be unveiling the current status of our effort in a Big Data Top 100 List and encourage you to participate in this community-based endeavor in defining an end-to-end application-layer benchmark for Big Data applications.