FOSDEM 2018
/
Schedule
/
Events
/
Developer rooms
/
HPC, Big Data, and Data Science
/
Productionizing Spark ML Pipelines with the Portable Format for Analytics

Productionizing Spark ML Pipelines with the Portable Format for Analytics

Track: HPC, Big Data, and Data Science devroom
Room: H.1302 (Depage)
Day: Sunday
Start: 15:30
End: 15:55

The common perception of machine learning is that it starts with data and ends with a model. In real-world production systems, the traditional data science and machine learning workflow of data preparation, feature engineering and model selection, while important, is only one aspect. A critical missing piece is the deployment and management of models, as well as the integration between the model creation and deployment phases.

This is particularly challenging in the case of deploying Apache Spark ML pipelines for low-latency scoring, since the Spark runtime is ill-suited to the needs of real-time predictive applications. In this talk I will introduce the Portable Format for Analytics (PFA) for portable, open and standardized deployment of data science pipelines and analytic applications. I will also introduce and evaluate Aardpfark, a library I have created for exporting Spark ML pipelines to PFA, as well as compare it to other open-source alternatives available in the community.