FOSDEM 2024
/
Schedule
/
Events
/
Developer rooms
/
HPC, Big Data & Data Science
/
Automating Spark (and Pipeline) Upgrades While "Testing" in Production

Automating Spark (and Pipeline) Upgrades While "Testing" in Production

Track: HPC, Big Data & Data Science devroom
Room: UA2.118 (Henriot)
Day: Saturday
Start: 15:00
End: 15:30
Video only: ua2118
Chat: Join the conversation!

With Spark 4 in the pipeline for this year, many of us are looking at what will be involved in upgrading to the latest and greatest Spark. This talk will look at the open-source tooling we use to automate upgrading thousands of our Spark pipelines (from Spark 2.X -> 3.4) and how to we used a variation of the write-audit-publish technique to validate the new pipelines in production for pipelines that might have less testing than ideal.

Seeing is a pre-requisite to believing, so the talk will include a short demo showing how the spark-upgrade tool works on a demo pipeline complete with "live" validation.

In this talk, you will learn how to: Upgrade your Spark pipelines without crying* Validating Spark (and other similar) pipelines even when you don't trust the tests (by extending the write-audit-publish pattern) on top of Iceberg, Hudi, or Delta Lake.

Time permitting, I will end with some exciting new (but non-backward compatible) changes coming in Spark 4 (tentatively scheduled for June, but it's software).

*Not a guarantee, some upgrades may still cause tears

Related links: https://github.com/holdenk/spark-upgrade - the upgrade tool being discussed https://scalacenter.github.io/scalafix/ - the code mod tool we used for scala https://pybowler.io/ - the python code refactoring tool we extended https://sqlfluff.com/ - the sql linting / cleanup tool we extended