Chaotic testing

The stabilization algorithm - quasardb is a distributed, peer-to-peer database based on the Chord Algorithm. It is actually much more than that, but for the purpose of this post we will focus on the Chord algorithm and one fundamental process: stabilization.

The Chord algorithm organizes the nodes in a giant ring ordered by node id (for more information, see Distributed Hash Tables).

In quasardb, this id is 256-bit large and guaranteed to be unique within the ring. In other words you cannot have two nodes with the same id on the same ring.

Stabilization is the process during which nodes exchange information to make sure they are properly ordered. They will also update information about their respective location to make look up work.

Another important step of the stabilization process is to ensure the nodes are actually running. For example, if a node's predecessor is unavailable, this changes the range of ids the node must manage.

Now, comes the trick question: how do you automatically test the correctness of your stabilization implementation?

A simple test

For this, we have the on/off test. The on/off test, one of the oldest we have, is all about turning nodes on and off and see if the ring can stabilize in a short time.

Running a sequence of the on/off test takes less thirty (30) seconds for a three nodes cluster.

Having said that, you could conclude that as long as you pass the test, your stabilization code is correct.

Implementation challenges

We know the algorithm is theorically sound, meaning that if stabilization doesn't converge you know you have something wrong. That makes writing a working implementation relatively easy.

The hard part is properly managing errors.

How does your cluster handle errors and recover from it? What should it do?

Exhaustivity

The stabilization algorithm isn't just an implementation of the logic described in research papers, it has subtle interactions with the network code of quasardb which is heavily multithreaded and asynchronous.

Not only that, but assuming we could totally abstract ourselves from the network code (which would probably have some high performance costs and might not even be desirable to being with) what your stabilization iteration will do depends on the state in which the remotes nodes are (up, down, unstable, fresh connection, reset connection, stable but with unstable predecessor, etc.).

One can quickly realize that even with only two nodes the number of possibility is extremely high.

Chaos

By inserting randomness into the on/off test, we can explore, with minimal efforts, a large number of cases by repeating the test (this approach also work to hunt for race conditions in other tests).

The concept is not unlinke Netflix' Chaos Monkey except it is a single program running virtual instances.

Writing a Python (or whatever scripting language you fancy) to run the test and capture its output in case of failure is trivial whereas trying to cover all possible cases with a long set of scenarii would have been much more time intensive.

Limits of the approach

The limits of this approach is that if you exclusively rely on chaotic tests you are certain to miss bugs as you are not guaranteed to repeat the "interesting" scenarii.

Additionally, chaotic testing takes more time than regular tests as you need to run the test a large amount of times to be statistically significant.

To avoid that issue, a solution is when your a randomized instance finds a bug, you can freeze the scenario into a specific, dedicated test that will be part of your regression suite.

Does this approach give you ideas? Share your thoughts!