Brussels / 30 & 31 January 2016


MADlib: Distributed In-Database Machine Learning for Fun and Profit

Apache MADlib (incubating) is an innovative SQL-based open source library for scalable in-database analytics. It provides parallel implementations of mathematical, statistical and machine learning methods for structured and unstructured data. MADlib also has an R interface for data scientists who prefer to work in R.

In this talk, we will describe the impetus behind creating a SQL-based scale-out machine learning project, review the architecture and implementation, and describe some of the recent functionality added by the Apache community. We will also present the R interface to MADlib, called PivotalR.

Finally, we will discuss the future direction of the project and invite big data developers and data scientists to participate in Apache MADlib, for both fun and profit.

The primary goal of the MADlib project is to accelerate innovation in the data science community via a shared library of scalable in-database analytics.

Many existing analytics products do not scale in a way that makes it convenient and economical to operate on very large data sets, which is a more and more common scenario in this era of Big Data. The methods in MADlib have been designed to take advantage of the shared-nothing, scale-out parallelism offered by modern parallel database engines.

Currently MADlib runs on the following open source platforms: Apache HAWQ (incubating) Hadoop-native SQL Database, Greenplum database, and PostgreSQL. Big Data has also brought about a renewed interest in query optimization, so as these platforms innovate in the area of distributed query performance, libraries such as MADlib can benefit greatly and provide even more significant benefits to research and commercial organizations that want to reason over very large data sets.

Some other key principles driving the architecture of MADlib are:

  • Operate on the data locally in-database. Do not move data between multiple runtime environments unnecessarily.

  • Utilize best of breed parallel database engines, but separate the machine learning logic from database specific implementation details.

  • Foster open community development, including active ties to academic research.

We look forward to presenting this topic at FOSDEM’16!


Frank McQuillan