Oracles and Mutation Testing: Adding Bugs to Code for Fun and Profit

Once you understand the concept of code coverage and why it might be (probably is!) helpful, one of the objections to code coverage starts to become more prominent than the others. This objection is also naturally brought to mind by the habitual use of fuzzers like afl. Namely, you can run code all day long, and put it through the paces of a huge number of execution paths, even, but the bugs you find are always limited by your very notion of “what is a bug?”

Oracles: What’s a Bug, Anyway?

Unless you do some additional work, the only bugs you can find in a program will often be those whose symptom is that the program crashes. For instance, this is the only behavior (other than a timeout) that the afl fuzzer considers to be a bug; to see what problems afl found, you look in a directory called crashes and the afl heads-up display tells you about bugs found via the “uniq crashes” “total crashes” and “last uniq crash” fields:

Programs do bad things other than crash, of course. If the only thing we cared about with respect to program correctness was “don’t crash, buddy” then we could just use “Hello world” as our only program; it doesn’t crash, so it must be correct for all purposes. This, of course, is nonsense. We want more than just not crashing. Now, one (good!) approach to finding more bugs, by broadening your notion of “bug,” is to simply make the Software Under Test (SUT) crash more! Adding assert statements is one well-established and very useful way to do this. Some guidelines for writing highly-critical code suggest having an assert every ten lines of code. An assert just checks that something that ought to be true at runtime is true, and crashes the program if it isn’t. Another way to “get more crashes” is to compile a program with a sanitizer. For example when using clang, adding -fsanitize=address will make your program crash if it performs a variety of unsafe memory operations that might otherwise not cause a problem, or at least not an obvious problem leading to a crash.

Assertions can assert anything, so they are a pretty general method here. However, there are some more specific kinds of specification, such as checking that a program, given an input, behaves equivalently to, or at least “close enough” to, a reference implementation. This is known as differential testing (you check that the “diff” with a known-good implementation of the same functionality is small enough). You can also check that the log of program operations satisfied certain constraints. Or you can require that the program complete running within a certain time limit, or produce certain output by a certain time. The possibilities are many, as wide as your notion of what a program ought to do, that can actually be checked.

The general term for a way to decide if a particular execution of a program is “good” or “bad” (if a test fails or passes, for example) is oracle. In the ancient world, oracles were places where the God or gods gave (often cryptic) privileged information about the universe. The Oracle at Delphi (associated with Apollo, the god of truth) is the best known of these, and the picture above shows a supplicant requesting some information (probably not whether his test passed) from the high priestess. Anyway, the term more generally means a high-quality, wise, insightful source of counsel. So, in testing, we use “oracle” to describe a source of insight into whether a program run did what it was supposed to do.

One point about oracles is that a bug in an oracle can be fairly disastrous, and while your oracles may not outright try to trick you, they can be almost as malicious to the unwary as the ancient ones. Herodotus relates, in book one of his Histories, that:

“the answers of both the Oracles agreed in one, declaring to Croesus that if he should march against the Persians he should destroy a great empire”

(Herodotus, The Histories, trans. G. C. Macaulay)

And… he did: his own. Similarly, if you set up an oracle that first checks each test for some precondition that controls the validity of the test with respect to some specification, then, if the precondition holds, checks a property of the test run, you may get a bad answer. You ask “does A imply B?” and the oracles says “Yes, oh Croesus the test engineer, A implies B.” You did not ask “Did you actually ever see a test satisfying A?” You do the math (false => anything).

For more on the fundamentals of oracles, grad students should read this under-appreciated paper (G-REQ).

Coverage and Oracles

This brings us back to the most important objection to code coverage as a measure of “test goodness.” You could run a program and get 100% code coverage, or even branch or path coverage, but not actually check anything about the run (other than maybe that it didn’t crash) and your tests might be next to worthless. They wouldn’t be totally worthless: not crashing isn’t everything, but it’s a pretty good starting point, of course!

But, let’s say you have something like a file system, and you’ve implemented it such that rename actually just deletes any directory you apply it to (it works for files). Let’s also say this doesn’t corrupt the file system state in any way… So, you run a billion CPU-hours of testing, and there are no crashes, and the code coverage is “everything was covered six kajillion times.” Of course, you still have that little bitty bug, since you didn’t actually check that rename does a rename. Fundamentally, traditional coverage only says “you made this software do something” not “you checked in any way that the something was right.”

Mutation Testing to the Rescue

The basic idea of mutation testing is very simple:

  • Add a bunch of (“fake”) bugs to the code
  • Grade the testing effort on how many of those bugs you find

We call the “grade” (usually measured as a percent of “good” mutants) the mutation score. If your test suite has a mutation score of 30%, it detected fewer than one in three inserted bugs. That’s… probably pretty bad. Notice that the file system testing example above is probably easily handled this way; chances are since the tests check nothing about the results of file system operations, lots of weird/stupid changes to rename (such as immediately returning and doing nothing) will get through the tests without a crash, and the mutation score will be terrible.

This idea of mutation testing has the nice advantage that it (almost) measures what we really want to know about tests: are they good at finding bugs? The only caveat is that these are algorithmically inserted fake bugs, not real bugs, and there is a long-standing worry that maybe they aren’t like real bugs, and in particular that they may be easier to find than real bugs. That’s still a concern, but to some extent, not one you need to worry about until your testing has a good mutation score!

How does mutation testing introduce fake bugs? By making small changes to a program. For example, if a line of code is

if (number_of_connections == 0) {

then mutation testing will make new versions of the code with the line changed to, among other things:

if (number_of_connections != 0) {

if (number_of_connections < 0) {

if (number_of_connections > 0) {

if (number_of_connections <= 0) {

It’s that simple! Another common and very useful mutant is just to delete code. So if you have a line of code:

process_remaining_requests(& request_queue);

then a tool might turn that into:

/*process_remaining_requests(& request_queue);*/

If your tests let you get away with not doing anything where you do something in the real code, maybe that’s harmless (could be logging, or some sanity check that never fails), but most of the time it’s a pretty bad sign!

Tools for Mutation Testing

There are a lot of tools for mutation testing, which has become much more popular recently. Here’s a blog post on using DeepState in a mutation testing context: (REQ) with one well-supported tool. That tool is the universalmutator, which supports a wide variety of languages. Read the universalmutator README (REQ) for basic information on how to install and run the tool. Then, really, the best thing to do is just try it out, and see how well some testing you are doing catches mutants!

A quick summary of using the mutator is:

  • install the tool using pip
  • mutate code using something like: mutate <sourcefile> –mutantDir <directory to stick the mutants in>
  • if the language is C or C++, you probably want to add something like –cmd “make clean; make” to make sure the code still builds with the mutant in place
  • find out about your mutation score by “analyzing” the mutants: analyze_mutants <sourcefile> “<command to run your tests in bash> –mutantDir <directory with mutants> –timeout N –verbose
  • the command line to run tests should be such that it has a zero exit code if all goes well, and a non-zero exit code if something goes wrong
  • set the timeout so the system has time to run whatever testing you are using to completion, but short enough that it won’t spend forever if (as often happens) a mutant introduces an infinite loop

Published by Alex Groce

I'm an associate professor in the School of informatics, Computing, and Cyber Systems at Northern Arizona University. My research interests are primarily in software testing and analysis. I used to work at the Jet Propulsion Laboratory in Pasadena, CA, including on the Curiosity Rover, and I've found a lot of bugs in real software systems over the years. You can find out more about what I do through my website ( or my GitHub profile ( My recent work has resulted in a DSL and (I hope!) usable and powerful testing tool for Python, the TSTL system,, as well as contributions to the DeepState C/C++ unit testing interface to symbolic execution tools and fuzzers such as AFL and libFuzzer,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

Create your website with
Get started
%d bloggers like this: