The Butterfly Bug

At Bureau 14 we highly value code quality: it's the corner stone of our software development strategy and certainly one of our core values as a company. We believe that investing in code quality pays off in the long term, especially when our software is entrusted by our customer with mission-critical data.

However our customer's trust isn't the initial reason why we spend so much time and efforts on quality. The main reason is that we don't have a choice.

What we do

Our software, codenamed wrpme, is a highly multithreaded network data server. Under load, a wrpme instance easily runs more than dozens of threads, handles thousands of network connections and simultaneous client requests, allocates and frees gigabytes of memory and manages terabytes of data on disk.

Its peer-to-peer architecture means that all of this must be done while discussing and synchronizing with an arbitrary number of instances within the cluster.

If you add on top of this bespoke memory management and low level I/O routines, you realize the potential for bugs is unlimited.

Tales from the kernel world

My years as a kernel developer have taught me one thing: when you're the foundation, no bug is unimportant.

As soon as you have enough years up your sleeve you will avoid the most obvious errors that lead to memory corruption, race conditions, deadlocks and friends.

However, one day, you will meet the butterfly bug: a tiny, insignificant modification that is easily overlooked or dismissed because "we don't have time for this" or "this is fine, move on" and results billions of cycles later in a catastrophic error. You will spend days tracking down the bug until you realize the error is not where you think it is. A Ah! Ah! moment with the bitter taste of terror as you understand everything is much harder than you originally thought.

A couple of years ago, I was working on a cryptographic filter driver for the NT file system stack. Few days before the release date, the QA reported a bug present in the latest version of the driver: if you installed Oracle, while defragmenting the disk and running a couple of other tasks you would end up in a blue screen of death. Not very good!

Fortunately the set of changes was fairly limited - the product was already quite mature - so a quick sweep helped track down the change that caused the bug check (the savant word for blue screen of death).

And you know what? The code that caused the error was absolutely bug free. It was actually better than the original! For confidentialilty reasons, I cannot give the original and modified source code, however I can explain what I did.

There were a couple of lines logging debug information that were allocating memory for this purpose only. Since I was uncomfortable with memory allocation in the critical path for the sake of debugging, I decided to change the code to use the stack instead and take advantage of the modification to switch to safer API calls in order to make sure no buffer overrun could occur (the old code was using "less safe" calls and had therefore a potential for buffer overflows).

From a pure software engineering point of view, the new code was better. That's when you looked at it in one dimension. However in kernel development - and multithreaded development as well - you need to think in more than one dimension.

And guess what? In four dimensions, the new code was terrible. On Windows NT, the x86 kernel-mode stack is only 12 KiB large per thread. If your driver stack is high enough, 12 KiB go down very quickly and I was adding to the problem with a liberal usage of the stack.

"Come on, it's only a couple of hundred bytes, what harm could it do?" Well my friend, what it can do is a stack overflow in kernel mode, something our customers will be delighted with...

That's what I call a butterfly bug. Thanks to a thorough QA the bug was caught in time, but its devious nature could have very well resulted in very hard to reproduce problems.

How this affects us

With wrpme we can easily have similar issues, especially due to our multithreaded code, asynchronous I/O, lockfree structures and transactional memory. We're building a weapon of mass destruction that only wants to do one thing: blow up in our face. And when we think the problem isn't hard enough, we make sure we add new features that will make debugging even harder (Power up the globe!).

Fixing a customer bug can be extremely expensive, and I'm not talking about the reputation cost, I'm talking about the time it might take to find the root cause. Reproducing the bug is generally a challenge in itself.

So how do we avoid this? Well, it's simple, if bugs are hard to track and fix, do everything you can to avoid them! Amongst other things, we do the following:

We make the compiler work as much as possible (with static assertions and other meta programming techniques) to reach the "If it compiles, there are no obvious errors" point. C++11 helps a lot in that aspect.
Intensive STL algorithm usage and functional approach to avoid ill-constructed loops (there is a relatively low occurrence of the “for” and “while” keywords in our code).
All inputs and outputs go through a Boost.Fusion/Boost.Spirit stack. This prevents buffer overflows and invalid format strings error and yields faster code (at the cost of increased compilation time).
We build our software on three different platforms (FreeBSD, Linux, Windows) using three different compilers (Clang, gcc and MSVC) and run intensive tests on all of them to track down heisenbugs.
Intensive usage of static and dynamic analysis tools.
We regularly review the code and the architecture, question our assumptions and make sure there is no sanctuary. We also thoroughly review the third party libraries we use.
And most importantly: no error or bug is ever dismissed.

Sounds great? Not for us: we know we can do much better! Especially in the area of fuzzing and testing.

An obsolete development model?

In a way, the approach we have can be seen as archaic, especially in a world of web development and "I do everything for you" frameworks. Indeed, we have a very high cost per line ratio and our development is probably an order of magnitude slower than the competition. When I talk with RoR developers about how much time it takes to add a feature, they think I run a large company crumbling under the weight of bureaucracy.

To this I answer invariably: going fast is less important than going in the right direction.

Especially, from a philosophical point of view, I think it's better to have few features that actually work than many features that don't (deliver on your promises) and most of all, when working on quality, you have to answer the question: who will use this and for which purpose?

The world of NoSQL moves fast, with new products popping in and out every day with impressive (advertised) capabilities. The temptation is great to compromise on quality to "catch up".

And then we remember that our customers rely on our software for critical things such as computing the value at risk and other satanic financial operations. They care about two features: reliability and speed.

We believe code quality is the right way to deliver these features.