Atakua's Diary

Production code does not test the tests

There is one thing that has bothered me about the asymmetry of relation between the production code and the test code. People often say: “Your tests test the production code, and production code ensures the tests are not broken”.

But it did not feel right. Especially after learning the principle “as tests gets more specific, code becomes more generic” used in test-driven development. The production code is generic enough to accept many variations of testing code, and not of them would be correct ones.

We need to be sure that, when we change code, we will not introduce unnoticed undesirable changes in its behavior. That is important for refactoring of production code, also for modifying its behavior. It is as important for refactoring or otherwise transforming the tests.

We cannot safely rely on production code to fixate requirements on the tests as we are refactoring the test code. But we need to modify the tests from time to time. So how can we do that without fear?

Let’s see some (almost) trivial but quite realistic examples when introduction of undesirable behavioral changes to a test is not “noticed” by the associated production code. Then let’s talk about how we can counter that. To do that, we’ll return back to the production code and rethink whether it can help us to gauge test quality.

Counterexamples
What does test the tests?
Mutation of production
Summary

Counterexamples

Passing production code is NOT enough to draw conclusions about modifications done to the tests. I will demonstrate that on three examples. All of them are adapted from my experience of working with medium-sized code bases.

Accidentally removed test

Let’s start from an almost trivial and somewhat silly example. It is still quite realistic; I’ve seen this problem materialize many times in many different variations.

We start from a working combination of a production_function() and its test suite. Different aspects of operation are tested by test cases contained in functions test_1, test_2 and test_3:


production_function(input) -> int {...}


// bodies of test functions
void test_1() {...}
void test_2() {...}
void test_3() {...}

// list of tests to be run by test controller application
test_suite[] = {
    test_1,
    test_2,
    test_3,
};

The test_suite list is used to register test cases to be run automatically by our regular testing system. For some reason we decide to “temporarily” comment out test_2() from it:


production_function(input) -> int {...}

// bodies of test functions
void test_1() {...}
void test_2() {...}
void test_3() {...}

// list of tests to be run by test controller application
test_suite[] = {
    test_1,
//  test_2,
    test_3,
};

Reasons for doing that may be many. Maybe the test case was too slow and not interesting for us because we were working on something only important for test_3. Or maybe we there was some state surviving the end of the case, and we were debugging its effects on the later test cases. We can still run test_2 manually, but the regular testing script would no longer do that as long as the line is commented out.

Satisfied with whatever other changes we’ve made to our code, we rerun the regular “full” test suite to ensure we did not break anything, and it passes. Happy, we commit our changes, completely forgetting to restore test_2 in the list.

Although we have not modified the production code in any way, it failed to help us discover our mistake when “refactoring” the tests. Just the fact that the test suite continued to work was not enough to notice the loss of a test case.

There may be modifications to this scenario. Just a couple of other scenarios that I saw to realize in my practice.

An auto-discovery mechanism is used by the controller application to search for all applicable test cases. E.g., all functions starting from test_ are considered to be test cases. By renaming a test case, we will make it invisible to the controller and will have it silently excluded from testing.
In Python, it is possible to redefine the same symbol multiple times without making the program syntactically incorrect. It is therefore possible to copy-paste one of the test case methods in order to change its body, but forget to rename the copy. The second function definition will hide the first one, which worked only seconds ago. Now, it is no longer visible to the test suite runner.

Disarmed test

In the second example, we have a long and intricate test case function, possibly with several “phases” of arrange-act-assert contained within:


void test_something_by_doing_long_flow() {
    ... code here...

    do_stuff();
    do_more_stuff();
    do_step();

    ... code here...

    assert_equal(got_value, expected_value);

}

Now let’s imagine again that you are working on this test and want to “temporarily” disable a part of it. An easiest way to do so is to put an extra return statement in the middle of the function’s body:

void test_something_by_doing_long_flow() {
    ... code here...

    do_stuff();
    do_more_stuff();
    do_step();

    ... code here...

    return;
    assert_equal(got_value, expected_value);
}

Modified this way, the test case exits prematurely, before reaching the final assert statement.

Similarly to the previous scenario, you work on the function for a while, modify additional things in it, and at some point forget about that extra return. Your attention is never brought back to it, because the test suite continues to pass. The next thing that happens is that the “disarmed” test case ends up in the main branch of the repository.

It is even worse than before. If you monitor code coverage of production code, skipping the assert won’t even show up on it. The test still runs, all the code paths in the production code are visited. It is just that no final decision is made by it.

As it was with the first example, such a regressive change may not be obvious even for an attentive human reader. The longer the test functions body is, the easier it is to omit such a mistake and let the return remain.

“Test disarming” can take different forms resulting from e.g. incomplete copy-paste of conditionals. Consider another example:

void test_foobar() {
    ...
    exp_res = calculate_expected(...);
    act_res = get_result_from_production(...);

    // compare expected value against actually calculated
    assert_equal(exp_res, exp_res);
}

Have you noticed that the last line should have read: assert_equal(exp_res, act_res);? Instead, it compares value of exp_res with itself. The assertion will never fail. Because of that, we may perform almost any modification to the rest of the function’s body above it, and it will never be detected by the test.

Such an embarrassing defect could be present in a function from the moment of its creation. If TDD was not followed, then it has not been proven that the newly written test case can actually fail at all. It is created “fangless”. Or it may have been introduced later, when editing the test code and relying on the production code to “test the tests”.

Test for something that does not happen

I just experienced this recently with a piece of legacy code, which looked like this:

production_function(char buf[], control_state) {

    index = calculate_index(control_state);
    buf[index] = new_value(...);
}

The problem with this function was that a wrong item inside buf[] was updated because calculate_index() was returning a wrong value. I was unwilling to test calculate_index() directly, because it was an implementation detail too removed from the higher-level code I was interested in. I could call production_function() from a test case. I was especially concerned that the function was updating a wrong index in the buf[]. I was not interested in knowing which other indices it may or may not update during its work.

Let’s say, we are hunting after a subtle memory overrun issue, and want to ensure that it is no longer happening. Here is how a test for this requirement looked originally:


test_that_wrong_index_is_not_updated() {

    buf[] = allocate_mock_buf_with_all_zeros();
    control_state = mock_control_state();

    affected_index = <hardcoded value>;

    // supposed to update one of the items in buf[]
    production_function(buf, control_state);

    // assert that memory at affected_index was not changed
    assert_equal(buf[affected_index], 0);
}

Following the TDD, originally this test case failed, then I fixed calculate_index() in the production code, which made the test case pass.

But I was not satisfied with how test_that_wrong_index_is_not_updated() looked. That was because affected_index was a magic hardcoded literal. I wanted to show its connection to control_state so that a future reader would have easier time understanding why the test case is important.

The “refactored” version of the test case had alternatively_calculate_index() to show the “wrongful” connection between affected_index and control_state:


test_that_wrong_index_is_not_updated() {

    buf[] = allocate_mock_buf_with_all_zeros();
    control_state = mock_control_state();

    affected_index = alternatively_calculate_index(control_state);

    // supposed to update items in buf[], but not one at affected_index
    production_function(buf, control_state);

    // assert that memory at affected_index has not been changed
    assert_equal(buf[affected_index], 0);

}

This new test case continued to pass. However, I made a mistake in defining alternatively_calculate_index(), so the value it returned was not equal to <hardcoded value> used earlier in this position.

However, the assertion in the test was still fine! In hindsight, it was expected. Only a limited subset of buf[] items is updated; after the mistake, the assertion continued monitoring one of the unchanged memory locations.

This was a clear case of “Absence of evidence is not evidence of absence”. The hope for “correctness of a test case is controlled by its production code” did not play out well.

How did I discover that my modifications of the test were not equivalent with the original, and that I in fact had broken the test? By reverting the fix for calculate_index() in production code and rerunning the “refactored” test case against it. I expected the test case to start failing again. After all, it was designed to sense exactly for the problem I’ve restored in the production code. But it did not notice anything: the new variant of the test case passed. It was a clear indication that whatever modifications I had made to alternatively_calculate_index() were not correct.

We will return back to this observation that modifed production code was able to point out the problem where the unmodified one failed to achieve so.

What does test the tests?

I hope it is clear now why the notion “production code tests the tests” is not reliable in practice. Many test code transformations exist that leave a test syntactically correct but semantically non-equivalent, but running them against unmodified production code won’t raise any alarms.

Can we test the tests at all? If so, with what? Some help comes from an observation that certain tools would have raised alarms for the examples above. Specifically:

Removal of a whole test case would result in a decrease of total production code coverage, which should trigger an investigation.
Disarming of a test suite or disabling a chunk of it with an early return will decrease coverage of the testing code itself. That should be an even more obvious problem, because non-covered testing code is essentially dead (it is only used by the development team, compared to the production code).
Many static analysis tools will point out unreachable code, tautological comparison, commented-out code or always-true/false conditionals in tests.
Dynamic code analysis tools can be built and used to check that all assertions are reached at least once, that no test case is shadowing another with the same name, etc.

All of such tools perform tests on the code, regardless if it is considered a part of production or of a test suite. So the answer is: we can test tests with other tests!

Here lies a seemingly cyclic logical contradiction: what will test the tests for tests? Another layer of tests‽ Who will be testing those?

The solution lies in an observation that we want to be able to safely transform, i.e., modify, our test code. If there is no need for modifications, no tests for tests are required. In other words, if we do not edit our static analysis tools, then we do not have to test them.

Mutation of production

How else can we ensure that we can change a test without breaking it? Can production code help us in any way?

A glimpse of the answer can be seen in the observation that using a purposefully broken variant of production_function() in the third example allowed me to see an unexpected test pass, when I naturally expected to see a fail.

This is the key! Of course, both the original and transformed test cases should pass with the same copy of production code. But they also must fail on a same intentionally broken variant of production code!

There is a term for it: mutation testing. This discipline strengthens the aforementioned idea and makes it practical.

A big obstacle to effective mutation testing is its theoretical undecidability. While there could in principle be just one correct form of a production function (or at least several semantically equivalent representations), there is certainly more than one way to produce a broken variant (== mutant) from it. In the worst case, an unlimited number of programs should be fed to our tests to be able to tell if they all pass or fail the assertions.

Not all hope is lost though. There are empirical approaches to make the mutation testing practically applicable. But this exciting topic is beyond the scope of this post.

Summary

Production code does not test the tests.
Other tools can examine our tests when we modify them.
Mutated production code tests the tests.