Speakers | |
---|---|
Lindsay Holmwood | |
Schedule | |
Day | Saturday |
Room | Janson |
Start time | 16:00 |
End time | 16:45 |
Duration | 00:45 |
Info | |
Event type | Podium |
Track | Monitoring |
Language | English |
Media | |
Video (DIVX) |
Monitoring software is ripe for a renaissance. Now is the time to for building new tools and rethinking our problems. Leading the charge are two projects: cucumber-nagios, and Flapjack.
A systems administrator's role in today's technology landscape has never been so important. It's our responsibility to manage provisioning and maintenance of massive infrastructures, to anticipate ahead of time when capacity must be grown or shrunk, and increasingly, to make sure our applications scale.
While developer tools have improved tremendously, we sysadmins are still living in the dark ages, other than a few shining beacons of hope such as Puppet. We're still trying to make Nagios scale. We're still writing the same old monitoring checks. Getting statistics out of our applications is tedious and difficult, but increasingly important to scaling.
cucumber-nagios lets you describe how a website should work in natural language, and outputs whether it does in the Nagios plugin format. It includes a standard library of website interactions, so you don't have to rewrite the same Nagios checks over and over.
cucumber-nagios can also be used to check SSH logins, filesystem interactions, mail delivery, and Asterisk dialplans. By lowering the barrier of entry to writing fully featured checks, there's no reason not to start testing all of your infrastructure. But as you start adding more checks to your monitoring system you're going to notice slowdowns and reliability problems - enter Flapjack
Flapjack is a scalable and distributed monitoring system. It natively talks the Nagios plugin format (so you can use all your existing Nagios checks), and can easily be scaled from 1 server to 1000.
Flapjack breaks the monitoring lifecycle into several distinct chunks: workers that execute checks, notifiers that notify when checks fail, and an admin interface to manage checks and events.
By breaking the monitoring lifecycle up, it becomes incredibly easy to scale your monitoring system with your infrastructure. Need to monitor more servers? Just add another server to the pool of workers. Need to take down your workers for maintenance? Just spin up another pool, and turn off the old one.