Interview: Sylvain Lebresne

Sylvain Lebresne will give a talk about "The Apache Cassandra storage engine" at FOSDEM 2012.

Could you briefly introduce yourself?

My name is Sylvain Lebresne, I have a background in programming language theory but I am now somehow spending my days developing a distributed database. I've started using and contributing to Apache Cassandra about 2 years ago, but since a year I've had the luck to be hired by DataStax to devote my full time to the development of Cassandra. I'm a committer and PMC member on the project, as well as its release manager, and I'm having fun.

What will your talk be about, exactly?

I'll probably start by some introduction of Cassandra. But then I'd like to dive more specifically into its storage layer, how it writes and reads data on disk, why it does it that way and what consequence this has on performance. And I'll try to describe that layer as completely as time permits, including recent developments like leveled compaction, off-heap caching, etcetera.

What do you hope to accomplish by giving this talk? What do you expect?

If you model your data correctly, Cassandra can achieve quite good performance. But modeling for the best performance requires you to have at least a basic understanding of how the database lays out data internally and its different mechanisms. And more generally, I'm a firm believer that it's unrealistic to hope to get the best out of a database without understanding at least superficially how it works. So an objective of this talk is to try to dispense its fair share of knowledge about Cassandra's inner workings to help people understand what it is good at and to make the most of it.

But even if you don't have any use for a distributed database, I think the design choices made by the Cassandra storage layer are interesting ones and are probably less well-known than more traditional B-tree based designs. Thus it's my hope that sharing those will interest others.

As for my expectation, I obviously expect everyone to start using Cassandra as their database of choice right after my talk.

One of the nice features of Cassandra is the Cassandra Query Language (CQL), which offers an SQL-like syntax for querying the database. Can you explain how it compares with SQL?

CQL can be though as the subset of SQL making sense for Cassandra, with a few additions and semantic differences matching Cassandra specificities.

What is left out is what Cassandra cannot support by design, at least not in an efficient way (mostly joins and transactions). What is added is mainly some syntax to allow efficient use of Cassandra's wide rows support.

There are also a few differences in semantics. The canonical example is the INSERT and UPDATE queries. In Cassandra, a write doesn't involve a read, so an INSERT has no way to know that the row did not exist, or an UPDATE to know that it already exists. As a consequence both of these queries are equivalent in CQL and both mean 'insert or update'. But overall CQL differs little from SQL when their syntax coincides and we've made sure that when it does differ, it does it in intuitive and straightforward ways.

How many contributors does Cassandra have and how many of them are active developers?

There is about half a dozen active core committers but with a "long tail" of almost 200 contributors who have contributed one or more patches to address a specific problem they had.

Which features can we expect to appear in Cassandra this year?

The next major release (Cassandra 1.1) should be out a few weeks after FOSDEM. Some user visible changes will be the addition of isolation for batch mutations, cache improvements to make them easier to configure, more control over where data files are located (to allow the use of a mix of SSD for performance sensitive data and slower spinning drives for less sensitive ones for instance) and some improvements to CQL. Following that, our main focus is on improving ease of use, but we'll continue to improve performance too.