- Front page
- Practical information
- Press & Promotion
Interview: Anil Madhavapeddy
Could you briefly introduce yourself?
I'm a senior research fellow at Wolson College in Cambridge, and work at the Computer Laboratory. I'm not the usual academic, as I've spent a chunk of my career in industry, notably at NetApp, the NASA Mars program, and most recently XenSource/Citrix. I'm an OpenBSD developer, but have been slacking in recent years, and plan to spend some time as FOSDEM in the Xen dev room working on OpenBSD Xen support to make up for that! You can read more about me on my homepage.
Why are the traditional UNIX communication mechanisms like sockets, pipes and shared memory not efficient anymore on current hardware?
My group (the Systems Research Group) have been building experimental systems for quite a while, with a theme of enforcing safety and isolation without sacrificing performance. Some, like Xen, became very popular and are now widely deployed. More recently, we're hacking on a reactive exokernel that makes running cloud applications faster and simpler (the Mirage OS) on building big data processing engines that are more powerful than MapReduce (the CIEL universal execution engine), and even reconfigurable hardware CPUs and network interconnections.
All of these systems have one thing in common: they process a lot of data, and must exploit multi-core systems fully in order to achieve high performance. Thus, they do a lot of inter-process communication (IPC) at various levels: across virtual machine boundaries, kernel to user-space, or via shared memory abstractions such as pipes.
It used to be that a skilled programmer could figure out the correct API to use to communicate to a different process. Nowadays, however, with OS virtualisation making software more layered, and with multi-core making hardware more unpredictable, it is almost impossible to select an IPC mechanism that is fit for purpose.
So we hacked on an IPC benchmark that runs at all of the layers of the modern stack: within the hypervisor (to measure IPI performance), within the kernel (futex performance) and purely in userspace (such as a spinning shared memory transport).
What effect does a virtual environment have on the performance of the traditional UNIX communication mechanisms?
Quite a dramatic effect, some due to the limitations of hardware, and others due to the Xen domain scheduler needing to be a little smarter (something that various developers at Citrix are furiously hacking on as we speak!).
The biggest problem is that the combination of VM scheduling (in Xen), with process scheduling (in the guest VM) makes most operations much more latent than when running on native. Similarly, 64-bit VMs must jump through the hypervisor *and* kernel when performing a system call, due to the lack of segmentation protection (which is what is used in 32-bit Xen to protect the hypervisor from the guest kernel and userspace).
The result is that some operations which have a certain performance/safety tradeoff in native are much more skewed when virtualised. Thus, the choice of IPC mechanism also changes accordingly when virtualised.
What will your talk be about, exactly?
My talk will introduce some of the most common IPC mechanisms that are used in the wild today: the familiar POSIX ones, a futex-based shared memory transport, and also the Xen virtual device model (which is what you use when spawning an Amazon EC2 virtual machine with EBS storage, for example). I hope that a technically-savvy audience will come out of the talk with more knowledge about how communication works in a modern OS/hypervisor.
What do you hope to accomplish by giving this talk? What do you expect?
I'm really frustrated by the lack of an open, systematic approach to gathering benchmark data across the years. It would really useful when developing some of these systems to be able to examine performance results across a variety of machines, and also across times (e.g. from before multicore was widely available and SMP was how CPU parallelism was done).
In OpenBSD, there is a very useful 'email@example.com' list where system messages have been sent in by users for over 15 years. Developers can just look at the dmesg mailbox and determine how popular a bit of hardware is.
So...we're trying a little experiment that is similar. Our ipc-bench suite is open-source, and available on Github. We're making it so that a portion of it can be run as a self-test, and it gathers system information, and commits the result to Github. The idea is that every user can upload the results of their self-test to their own Github branch, and that we can merge them all into one file-system database of performance results.
There are some interesting challenges here, such as how we version the results so that past results aren't completely useless when we modify the benchmark suite (via some shared-library-style major/minor/epoch perhaps).
This is early days, and highly experimental, but we hope to come up with something that doesn't depend on a single group maintaining it (via our use of Github, anyone can combine all the pull requests, not just us), and also loosely coupled. We also have to be very careful not to gather any personally identifiable data in these results, such as hostnames! We discussed some of those issues briefly in a USENIX short paper about ipc-bench. It turns out that there were a massive number of submissions this year, so this little one is unlikely to get into the final program, but we think it's interesting anyway! :-)
How can people contribute to your research with ipc-bench and what results do you expect?
We're working away at packaging it up in time for FOSDEM, so I hope that people can run it, and also contribute to the suite of shared library techniques it implements. We've started gathering results on our group page for ipc-bench.
OS portability is also an outstanding thing to do. It is a Linux-centric suite now, but we would (of course) like *BSD support. This requires figuring out exactly what synchronisation mechanisms are available (Futexes are very useful, but also non-portable), and also the impact of subtle semantic differences between them. Patches are always welcome!
Can you explain at a high level how the FABLE service for automatically reconfiguring communication mechanisms works?
FABLE is still early days: I put it in my talk abstract as a teaser to the audience to find out if there's any demand for it (and judging from the number of questions I've had since it went public, the answer is yes!).
The idea is to add first-class support for reconfigurable I/O to UNIX-like systems, with an API that is more suited to high-bandwidth data communication than the socket API. There's a draft paper available that describes it more detail that we will present at the RESoLVE workshop in ASPLOS in London, later in March.
The reconfiguration process is deliberately asynchronous, and decoupled from the main connection setup path. The reason for this is that it isn't obvious from a single path what the most optimal data communication mechanism is: it depends on the end-to-end path (e.g. pages coming in from disk and being proxied directly to a network card via DMA), and also resource pressures on the overall system (if memory bandwidth is at a premium, it may be better to switch to page-flipping rather than copying pages, as Xen defaults to now for virtual I/O).
As for policy: we don't care at this stage. It's a userspace daemon, and so anything can go there. I imagine it will be hooked up to various management stacks which can decide which consumers need to the best access to resources.
Have you enjoyed previous FOSDEM editions?
Very much so. I really enjoyed reading some of the other speaker interviews, and am particularly looking forward to hearing Bdale Garbee speak about the FreedomBox initiative!
This is my first physical attendance at the conference itself. If there are any OCaml enthusiasts out there, I believe a few of us may get together for an informal beer BoF. Drop me an email if you want to attend!
This interview is licensed under a Creative Commons Attribution 2.0 Belgium License.
Sun, 01/29/2012 - 18:50