Brussels / 30 & 31 January 2016


Tackling non-determinism in Hadoop

Testing and debugging distributed systems with Earthquake

Developing and maintaining distributed systems like Hadoop is difficult. The difficulty comes from many factors, but we believe that one of the most important reasons is lacking of a good debugger for bugs specific to distributed systems. (e.g., non-deterministic hardware faults, message ordering, ..)

In the talk, we will show Earthquake, our open-source debugging framework for distributed systems. Earthquakes permutes Ethernet packets, Filesystem events, Java/C function calls, and injected faults in various orders so as to control non-determinism in the cluster. Basically, Earthquake permutes events in a random order, but the user can write his/her own state exploration policy (in Go language) for finding deep bugs efficiently. Earthquake also controls non-determinism of the thread interleaving by calling sched_setattr(2) with randomized parameters.

We will also share our successful stories about testing some Hadoop components with Earthquake. For ZooKeeper, we found a distributed race condition bug which decreases availability of a ZooKeeper cluster. We also reproduced a known ZooKeeper bug that no one had successfully reproduced for 2 years, and analyzed its cause. For YARN, we found a disk-fault tolerance bug that inappropriately marks faulty node as healthy. We also found bugs of non-Hadoop softwares, such as etcd.

With Earthquake, you can also test your real distibuted systems without any modification.


