Automating Spark (and Pipeline) Upgrades While "Testing" in Production
- Track: HPC, Big Data & Data Science devroom
- Room: UA2.118 (Henriot)
- Day: Saturday
- Start: 15:00
- End: 15:30
- Video only: ua2118
- Chat: Join the conversation!
With Spark 4 in the pipeline for this year, many of us are looking at what will be involved in upgrading to the latest and greatest Spark. This talk will look at the open-source tooling we use to automate upgrading thousands of our Spark pipelines (from Spark 2.X -> 3.4) and how to we used a variation of the write-audit-publish technique to validate the new pipelines in production for pipelines that might have less testing than ideal.
Seeing is a pre-requisite to believing, so the talk will include a short demo showing how the spark-upgrade tool works on a demo pipeline complete with "live" validation.
In this talk, you will learn how to: Upgrade your Spark pipelines without crying* Validating Spark (and other similar) pipelines even when you don't trust the tests (by extending the write-audit-publish pattern) on top of Iceberg, Hudi, or Delta Lake.
Time permitting, I will end with some exciting new (but non-backward compatible) changes coming in Spark 4 (tentatively scheduled for June, but it's software).
*Not a guarantee, some upgrades may still cause tears
Related links: https://github.com/holdenk/spark-upgrade - the upgrade tool being discussed https://scalacenter.github.io/scalafix/ - the code mod tool we used for scala https://pybowler.io/ - the python code refactoring tool we extended https://sqlfluff.com/ - the sql linting / cleanup tool we extended
Speakers
Holden Karau |