SPyQL - SQL with Python in the middle
Making command-line data processing more intuitive, readable and powerful
- Track: Python devroom
- Room: D.python
- Day: Sunday
- Start: 14:30
- End: 15:00
- Video with Q&A: D.python
- Video only: D.python
- Chat: Join the conversation!
SPyQL is SQL with Python in the middle, an open-source project fully written in Python for making command-line data processing more intuitive, readable and powerful. Try mixing in the same pot: a SQL SELECT for providing the structure, Python expressions for defining transformations and conditions, the essence of awk as a data-processing language, and the JSON handling capabilities of jq.
In this event I will describe the SPyQL language, highlighting its unique features. By the end of this presentation you will know how to write SPyQL queries (probably you already know :-) ), and you will be looking forward to starting using it! I will be solving the task of calculating aggregations in awk (for a CSV), in jq (for a JSON) and in SPyQL (for both). I will then show you a couple more examples where we will be using SPyQL 1) to automate a scaling operation of k8s pods, and 2) to continuously calculate statistics from a Kafka data stream.
How does a SPyQL query looks like?
IMPORT pendulum AS p
SELECT
(p.now() - p.from_timestamp(purchase_ts)).in_days() AS days_ago,
sum_agg(price * quantity) AS total
FROM csv
WHERE department.upper() == 'IT' and purchase_ts is not Null
GROUP BY 1
ORDER BY 1
TO json
Simple, readable, and, as all SPyQL programs, it's an 1-liner. In a single statement we are 1) reading a CSV (of purchases) with automatic header detection, dialect detection, type inference and casting, 2) filtering out records that do not belong to the IT department or do not have a purchase timestamp 3) summing the total purchases and grouping by how many days ago they happened, 4) sorting from the most to the least recent day and 5) writing the result in JSON format. All this without loading the dataset into memory.
SPyQL will change data-processing in the terminal, making it accessible to anyone who knows a little-bit of Python and understands the basics of a SQL SELECT. On the other hand, it will give super-powers to experienced users. The possibilities are endless as you can import any Python library, and pipe data from/to any command-line tool. From querying APIs and Kafka, to write to files or databases, SPyQL will be the tool of choice for processing data in the command-line!
Speakers
Daniel Moura |