Interview: Lindsay Holmwood

Lindsay Holmwood will give a talk about Flapjack at FOSDEM 2010.

Could you briefly introduce yourself?

I'm Lindsay Holmwood, a sysadmin/developer from Sydney, Australia. I started my career as a sysadmin doing large scale desktop Linux deployments, eventually moving into web development and custom integration with many legacy systems.

I spent last year on sabbatical backpacking around Europe, building sysadmin tools in my spare time and talking at conferences. This year I'm moving back into the systems administration world, while still working on tools to make sysadmining fun.

What will your talk be about, exactly?

I'll be talking about stepping back and taking a serious look at what monitoring tools are actually trying to accomplish, and how we can reframe the questions to get better visibility of our infrastructure, and scale monitoring infrastructures more effectively.

What do you hope to accomplish by giving this talk? What do you expect?

Hopefully people will leave my talk questioning their assumptions about monitoring tools, and go and create better tools themselves, or contribute to a few of my projects. :-)

What's the history of the Flapjack project? What was your motivation to start it and how did it evolve?

I started conceptualising Flapjack almost two years ago. I was setting up and maintaining several monitoring systems for different clients, and all the systems rubbed me the wrong way.

To me, they all seemed to be making the problem much more complex than it was, either by having horrible configuration processes, terrible web interfaces, or lots of unnecessary moving parts.

Maybe I'm naive, but I think most problems can be solved simply and elegantly if you take a step back and think about what you're trying to achieve.

I had several long debates with friends and fellow sysadmins (and developers!) about the core problems monitoring systems are trying to solve for about a year before I actually sat down and wrote any code.

I initially wrote a very simple prototype called "Series of tubes" that used Beanstalkd to divide up the monitoring workload. I did lots of performance testing and analysis before settling on a design to evolve into a more concrete prototype.

I built the second prototype while backpacking last year and presented an alpha at Rails Underground, and a beta at Devopsdays in Belgium. I had some amazingly great discussions with attendees and speakers at Devopsdays about Flapjack, which, in my mind, confirmed that I was on the right track.

I've spent the last two months incorporating ideas from Devopsdays into Flapjack, and now you could consider Flapjack to be the infrastructure beneath a fully fledged monitoring system (sort of like how Git could be considered the infrastructure for a version control system :-).

Why would one choose Flapjack over the countless other monitoring systems? What's the unique selling point?

Massive scalability, well tested code in a high level language, and clear APIs for hooking into the monitoring lifecycle.

From the benchmarking and performance analysis I've done, Flapjack has proven to scale linearly. Got more checks? Just spin up some more nodes and point them at your Flapjack setup.

This also works in reverse, where you may need to spin up a few hundred EC2 nodes to run some short run (but monitored) batch job. Spin up new monitoring nodes for the duration of the job, then take them down again.

Flapjack makes this sort of elasticity mind bendingly easy.

Secondly, Flapjack is written in Ruby and has extremely good test coverage. I am somewhat OCD when it comes to testing code that I write, and get extremely disheartened by the number of sysadmin tools that eschew testing altogether.

I'm trying to lead by example and show how easy it is to write testable software, and thus improve reliability and readability of the code - less bugs mean less monitoring alerts at 3am in the morning!

Lastly, the APIs are probably what excites me the most about Flapjack. There are APIs for writing custom notifiers, deciding whether to notify on a check (filters), storing check data in a database (persistence), and communicating between the different components (transports).

The core of Flapjack is very small, but there are hooks all throughout that make customising Flapjack to your environment (by writing your own customisation or reusing others) very easy.

This is one thing I think the Chef guys have done extremely well, though Puppet is catching up quickly. Having clear and well documented APIs is extremely important for getting people involved in your community.

Flapjack boasts about scalability, but how does it make it possible to monitor 1000 systems as easily as one?

Not to give too much away (you should come to my talk! :-), Flapjack breaks the monitoring lifecycle up into several distinct cycles that are asynchronous and independent of each other.

Scaling up is simply a matter of adding more nodes into the checking cycle. The exact same code is run regardless of whether you're running Flapjack on one box or hundred.

The architecture of Flapjack is built upon loosely coupled components to make it easy to swap them out. Can you give some concrete examples where that could come in handy?

We're not tied to Ruby. Quite a few sysadmins have misgivings about deploying Ruby in their environments. The workers (components that execute checks) could very easily be reimplemented in Python. The notifier would be significantly harder to reimplement but not impossible.

If someone decides I'm on crack and wants to rewrite a whole Flapjack component, they can do that and still use the rest of the system.

How many developers are working on Flapjack?

Right now I'm currently the only developer, though I've been helping a few people familiarise themselves with the code.

Serafeim Zanikolas has been doing some excellent work preparing Flapjack for getting into Debian, and Bernd Ahlers and Stephen Nelson-Smith have volunteered to package for OpenBSD & CentOS/RHEL respectively.

What new features will we see in Flapjack in 2010?

I'll be focussing on three things over the next year: building a better web interface, making it easy to deploy, and smoothing off the rough edges.

The web interface is going to get a *lot* of love. I had to throw out the prototype interface because the persistence API changed the way we represent and interact with checks completely. The web interface will most likely be split out into two separate interfaces: a dashboard and administrative/configuration interface.

I'll be integrating another project I'm working on called Visage into the dashboard, so there will be pretty graphs that make selling Flapjack to your boss much easier. :-)

You can have the best software in the world but if nobody can install it, it doesn't count for Jack. Right now Flapjack is released as a RubyGem, which is completely ghetto and inappropriate for a sysadmin tool. I'll be working with packagers to build packages for distributions.

I actually have a set of tests to verify the packagability and deployability of the project, which are all currently failing. This needs to be fixed, but it's great for testing regressions.

Finally, Flapjack has a lot of rough edges documentation-wise. Based on the existing documentation you could probably stumble through setting up Flapjack, but it wouldn't be an experience to write home about.

I truly admire Django's documentation, and if I can make Flapjack's documentation 1/10th as good as Django's I think I will have succeeded.

Have you enjoyed previous FOSDEM editions?

This is actually the first time I've been to FOSDEM, but everyone I've spoken to who has been has said it's a truly awesome experience. I'm really looking forward to it!

This interview is licensed under a Creative Commons Attribution 2.0 Belgium License.

Speakers

fosdem.org

User login