Atakua's Diary

Do you write your tests after the fact?

Do you write your tests after corresponding production code has been written? Do you run your new test, see it pass, consider the job done and then move on onto the next task?

Here is a thought experiment for you. I hope it will illustrate a problem with this approach.

Hello world?
Hello test
Tests are programs that are unjustly judged
Inverted terminology
Writing tests before and after production code
What about bugs in tests?
Summary

Hello world?

Imagine you need to write a simple program, e.g. one that outputs “Hello, world” on the screen and maybe does some other things.

Right, you’ve written the code and it compiles without errors. You run it and… nothing is produced on the screen. There is no error, but no “Hello world” is visible either.

But you re-read text of the program and you clearly see that it contains the necessary string and there is clearly a call to printing function. It must be working, by the looks of it.

One potential text for “Hello world” in C where you also wanted to redirect the output to a user-specified file stream†.

#include <stdio.h>

int main(int argc, char** argv) {
    int stdio = argv[1];
    dprintf(stdio, "%s\n", "Hello world");
    return 0;
}

Observation 1: Well, you think, by the look, there is no apparent error in my program. I am sure it will work just fine in the production. Let’s push it to production!

Would you accept such course of actions? I suspect not. While appearing to work, the actual execution of the program does not demonstrate a correct behavior. Obviously more work on it is required.

But let’s change the assignment a tiny bit and see how it will affect the reasoning.

Hello test

Imagine you have written a piece of production code. Then you decided to write a test case for it. You wrote the test, it compiles without an error, you run it and… nothing is printed on the screen, it passes! There is no error.

Observation 2: Well, you think, by the look, there is no apparent error in my test. I am sure this test will work just fine in the production. Let’s push it to production!

Looks pretty normal, right?

Tests are programs that are unjustly judged

Compare the two observations highlighted above. They are worded almost identically, only with word “program” replaced by “test”.

But for some reason, the first one looks wrong for most people, while the second one may look OK to some. However, swapping a placeholder word in a logical reasoning should not have such a dramatic effect on its truthfulness!

The both lines of reasoning are in fact wrong. A test case is also a program. Its purpose is to report a discrepancy between expected and observed behaviors. Running it once and not seeing such a report is no basis for concluding that the test is correct. Why? Exactly for the same reason why running any other program and not seeing it to work is enough evidence that we are not done with it yet.

Why does there exist such a disconnect between our treatment of production code and test? I believe it can be in part explained by the terminology we’ve been using and emotional connotations it has created in our minds.

Inverted terminology

A normal piece of code is made to succeed. Whenever it fails, either by going into an erroneous state or returning incorrect results, it is considered to be an undesirable outcome.

On the other hand, a succeeding test case is our usual, normal expectation. We draw it from comparing it to expectations made for production code. The opposite situation, when a test case fails, feels “bad”. It feels undesirable and worth changing back to the default “good” passing state as quickly as possible. Same as with a failure in the production code.

And here is where the wrong assumption is hidden. Test cases’ purpose of existing is not to always succeed. This is not a right expectation for them to live up to. Their purpose is to be able to tell apart “normal” and “non-normal” behaviors of associated production code. When the test “fails”, it is not because the test is bad. It is because it took its input data (which is produced by another program), processed it and returned a correct result: that input data is wrong!

To “fail” is a bad word for us humans. We subconsciously avoid experiencing, observing and dealing with failures. But that creates a weakness and bias in our judgement, like the unfair treatment of testing code described in the beginning.

Let’s change terminology for tests a bit. A test case does not “fail” or “pass”, it senses for two different situations in the behavior of associated production piece. The primary function of a test is to sense an anomaly in its input.

Each test is a small but independent program not directly tied to production code, a.k.a. the “main” program. Surely, each test consumes a part of the behavior of the latter, but it is an input that could be abstracted away from the producer.

Therefore, when we are preparing a new test case program, we should focus on its main function: sensing for and reporting two situations. We only use the production code to feed it data.

Observing only one of the test outcomes (no anomalies in test input) as was done in the second example above, is not enough. Similarly, observing only the alternative (always failing regardless of input) is just as bad. We need to see both of them.

If we have observed both alternatives to play out as we manipulate the production code (by intentionally having it “broken” at some point and restoring it again later), only then we can be relatively sure that the new test case fully does its job of telling situations apart.

In other words, observing a true negative outcome does not prove anything about the true positive outcome.

Writing tests before and after production code

The test-driven development (TDD) approach tells us to write a new test case even before its matching production code is ready. Pretty crazy, huh?

But this helps us with honest treatment of test cases. With TDD, when preparing a new test case, we know for sure that it will only act on “bad” input. The corresponding piece of production code has not yet been written, so anything that the “main” program produces in relation to the new test should be sensed as being “bad”, should be reported as anomaly, and we are required to observe it before moving towards writing production code.

Then we switch to adding production code. We have our new test at hand and we can rerun it periodically to check whether or not we have achieved the “good” input consumed by the test. Again, we are forced to observe the test case’s reaction on our efforts. But this time we are striving to make the same test sense for the change in its input. As the result, we will have observed both outcomes of the test. We have high confidence that it were our recent and precisely known production code additions that have caused the test going from “failed” to “passed”.

You can have almost the same effect even if your tests are prepared after the fact. First you prepare a test that passes on a piece of existing production code. But what you must do after that is to intentionally break the production for a moment and rerun the same test. The observed outcome of the test must change from “no anomalies” to “failed expectation”. That will be a confirmation that the connection you intend to exist between the production and test case is indeed there. Both outcomes of the test case were observed, and you know for sure what part of the production code affects this exact outcome.

This reveals the weakness of the approach when test are prepared after the fact. There is usually a big temptation to avoid doing the second part. To simply move on with the implementation of a next sub-task is typical despite the botched reasoning behind it. TDD prevents us from making such Mephistophelean deal by forcing us observe both outcomes.

What about bugs in tests?

If a test case is another program, how to test it? Do we even need to test it? Does associated production code test it? I wrote a bit about these topics before here and here.

Summary

When preparing your test, make sure you have observed it fail at least once.
Be sure to establish the connection between production code and test code by being able to trigger different outcomes of the test through manipulations over production code.
If you cannot make your test case fail, it is just as bad as if you cannot not make it pass. It means that you have not yet fully understood cause-effect connections in related production code.
TDD imposes the discipline of seeing the both outcomes of every test case. It also enforces the requirement that you know which changes to production code affect the test.
Tests are small programs independent from the main production code. They do consume results of the latter one, but they deserve equal treatment in terms of manual verification that they work for all kinds of inputs.

† There is an intentional missing range checking for argv in this program, which might explain the observed (undefined) behavior.