Atakua's Diary

How to test the tests

“What/who tests the tests?” This question arises at some point while learning Test-Driven Development (TDD). If every piece of production code ends up tested in the TDD process by construction, does the same apply to the code of the produced tests themselves?

It’s time to go meta.

Proper structure of a test
Why would you want to test tests?
Can a test case be created already broken?
Can a test case become broken?
What does NOT test the tests
Bad test behaviors to look out for
Things left undiscussed
Summary

Proper structure of a test

It is not enough to take just any arbitrary program that reports a binary error code and call it a test. While its outcome might correlate with the properties of the system under test, doing so without strategy also risks of bringing a lot of test fragility, poor failure localization and other undesirable properties with it. We need some ground rules to define a good test.

We will only consider those programs that have been purposefully written to be tests. They have a fixed structure of three phases (Arrange, Act and Assert). Each test case is associated with its own specific use case scenario. All the tests are associated with a program serving as their input, which we’ll call production code.

Why would you want to test tests?

Tests are also code. As any other code, testing code may contain bugs. Situations of reporting either a false positive (reaction when no associated production code change has happened) or a false negative (no reaction on a regression in production code) should be considered as bugs. Another manifestation of a bug in a test is that when it never completes, thus giving no result at all.

Every test case by itself should be treated as a separate independent program that takes input and returns result. Its input happens to be another program (our production code). Its result happens to be boolean value, interpreted as fail/pass.

Just as you want to be sure that your production code implements all valuable use cases and does not exhibit any undesirable behaviors, you’d want your tests to have same guarantees. For production code, the confidence is created by, among other things, testing it. For the same reason, it would be thinkable to also test these test programs.

Can a test case be created already broken?

Writing a smallest possible test case before its matching production code is created, i.e., doing it TDD-style, usually gives us enough confidence that we have indeed observed both true positive and true negative outcomes. Essentially, we have manually ensured that the test starts its life as a correct one.

In contrast to that, adding a new test for pre-written, already existing production code is much more risky.

When we take over a new project with already existing tests, we want to obtain confidence that they are reliable. Surely, we want to see them pass, but more importantly, we also want to see them to react when an obvious regression to production code is admitted. Until then, it cannot be said that the inherited test suite does not hide false positives, i.e. bugs in the tests.

Can a test case become broken?

One way or another, we should get our test cases to become trusted and thus useful and valuable. What happens next as the time goes and requirements change? Can the tests inadvertently become bad? Can we ensure to preserve the value they have originally had?

A test, like any other digital program, does not go bad over time by itself. Its behavior may differ from its historically observed outcomes because: 1) its inputs are different, or 2) the text of the test itself has been modified *.

Changing input means altering the production code. In this context, it is desirable for the tests to react on those. Sensitivity to the input is of course desirable.
Changing the text of the test means that either we refactor it to get a better structure, or we adjust it to reflect its altered use case after the requirements have been changed.

Calling something a refactoring does not automatically prevent behavioral changes from occurring. We would like to actually validate the original invariants to still hold, instead of just hoping for the best.

Adjusting text of a test file essentially means creating a new test case for pre-written production code. This is something that we have already warned against. We will have to go through the troubles of ensuring we’ve seen the modified test produce both of desirable outcomes again, this time for the new use case. We may assume that certain “small” modifications, i.e. changes to reference values, are “safe” and won’t render the test invalid. But we’d still want to back this hope up with some sort of an experiment.

To sum up, we should want to test existing test cases when we change them for any reason. Non-surprisingly, it happens to be the same thing we want for our production code: when it changes, it has to be tested.

What does NOT test the tests

I’ve already written about why I am certain that production code does not test associated tests.

Think about it: a test case is a program with two outcomes, and production code is its input. By providing only one input, we cannot expect it to provoke both outcomes. And this is exactly what we do most often: we rerun our tests for the same snapshot of the production code, expecting them to pass. This simply does not say anything about how/when/if they ever fail.

We’d have to provide at least two different inputs, i.e., at least two variants of production code, to see both pass and fail outcomes.

Bad test behaviors to look out for

So, we have changed text of a test case. Is this test still good, or have we allowed a problem to slip into it?

Let’s try to list some of undesirable behavioral changes that we might have introduced. This is by no means an exhaustive list.

For each such testing defect class, we will talk if and how it can be automatically detected, and what sort of meta-test should be applied in each case.

The test case is not started

The simplest problem that can happen with a single test is that it is not started as a part of the whole test suite. The affected case does not run and thus is not able to produce any results.

Usually it is not a problem in the test case itself, but rather in the surrounding environment, so called test-runner. The usual reason is a failure to discover and schedule that test for execution. Maybe it was commented out, renamed, retagged in such a way that the discovery mechanism no longer considers it as something that needs to be entered.

With huge number of tests constituting a regular automatic test suite, manually discovering that a few cases are not regularly executed is rather hard. It is not uncommon that weeks or years pass until someone stumbles on such abandoned test case and wonders what is going on.

How to prevent this situation? A meta-test that does the following will help. It independently enumerates all existing test cases, compares that set against the set of recently run tests, and reports the difference. If the difference is empty, we are good; otherwise there are forgotten cases.

Looking at a coverage report for the test files (note: not the coverage of production code provided by the tests, but coverage of the test files themselves!) will also reveal the same problem. If there is a white spot in the report, that test file has not been visited during the test suite run.

Not all code sections inside test case are used

Let’s deepen the idea of unreached test code from the previous section. A situation when only some lines of the code from given test file are executed is highly suspicious.

If the Assert phase is not reached, this means that the test cannot react on the violation of condition it was written to monitor.
If parts of Arrange of Act are omitted, why are they even present in the file? There are no other contexts in which those skipped code paths would play a role: you are the only customer of your own test suite, it is not shipped to customers (otherwise it would have been production code by definition!).

Coverage report for the test files is again our friend with reliably discovering such problems. There should be very little reason for the test suite to have coverage less than 100%. If there are pieces and sub-expressions in the test suite that are not reached during test runs, they can be safely removed from the test suite as unused.

A story from my practice. I searched for return statements in our Python test suite that returned void result (implicitly None), i.e., return # end of line. All of those places turned out to be bugs where a test case was prematurely finished. A return statement without the return value is a blatant giveaway that the surrounding function is called for its side effects, and some of those side effects (those that come later in the file after such return) have been skipped. Most often, it was a forgotten debug statement when someone had “temporarily” disabled a portion of the file, but then forgot to both re-enable it and to check that the test can still react.

There are no assertions

Another case that happens in practice is when a test case contains no explicit assertions. It does however contain some arrangement for the system under test, and some actions are performed over it. But no assertions or expectations for the final (or even intermediate) program state can be found.

What an author of such case most likely meant to express here in the test is this: during execution of the actions in the test, no exceptional situations will arise.

Why is this approach undesirable?

It is not specific. There are many exceptions that can happen, and many of them are worth looking out for in tests. But those cases should be isolated in separate cases. Looking for “all” exceptions is as equally bad as it is in production code.
It conflates test failure and test error outcomes together. They are different. This contributes to test’s fragility.
It is very unlikely that a human has observed it to actually fail. Had it been observed to fail, then its author would have known what exact class of exception and would have fixated it in the test’s (non-existent) assertion. As is stands, we cannot even guarantee that such a test would react at all when the time comes.
The approach essentially misplaces the responsibility of reporting the reaction from the test itself upwards to the test-runner, where the exception unwinding will end up (keep in mind that there was no code waiting for any particular exception in the test itself!). Because the test-runner is not aware about specifics of the use case in question, the error message is inevitably very generic and unhelpful: “something has happened at that place”.

How to detect this defect in your test suite? Any kind of static analysis that verifies that expectation statements are present and reachable in every test case will help.

A simple grep -R --files-without-match "assert|expect" will return names of files that do not contain any assertions. It will not be as accurate, but it will give you some idea on the situation.

A dynamic analysis pass that verifies that at least one assertion have been entered by the end of test execution is another approach. Some sort of a spy object should be connected to all assertions (which can be made into objects), and, upon the destruction of that object (which normally should always be reached at the end of the file), the spy verifies about how many assertions were encountered.

To fix this problem in a test, turn the implicit assertion “this arrange/act sequence does not throw any exceptions” into an explicit “this act code does not throw this specific set of exceptions”. To help yourself with it, corrupt the production code in such a way to observe the test fail. If you cannot imagine how to make it fail, throw this test away.

It can get even worse. Some tests that I saw encode the expectation “this sequence will not time out”. This is as unspecific as it can be, and it is a big pain to debug.

Some of assertions are not reached

A good test has Arrange, Act and Assert happening in that exact order. Each of the three phases occurs exactly once.

We’ve already talked about the case when there are fewer than one assertion in a test case. If there is more than one assertion, this means that two or more simpler test cases are hidden inside this test**.

If some of those assertions are not reached, which can be easily observed using the coverage report, it means that other symptoms described earlier, such as non-100% coverage and early returns, are very likely to also happen in this case.

To detect it, variations of earlier methods can be used: statically or dynamically count number of reachable asserts.

The solution to improving the situation is to split the test into several smaller, more focused ones, each containing one logical assert.

The assertions do not fire up when production code gets a regression

All right, we have an exactly one assertion and it is always reached. Are we good? Not yet. There may still be bugs in the asserted condition, leading to false negatives.

In the simplest cases, this happens because of a copy-paste error, a misunderstood requirement or a botched refactoring of the test.

expected = get_expected()

actual = get_actual()

assert (actual, actual)

Notice how the one of the asserts arguments is misspelled. This disarms the test case, making it unable to fail the expectation.

A good static code analyzer will usually report this as either an unused value, a tautological comparison, or both.

Ensuring that you’ve observed the test to properly fail after every change made to it (including its moment of creation) is a reliable way to guarantee that the test case indeed continues to work. But it is likely an impractical thing to do, especially when many tests are changed at the same time. One such example is when you migrate or adapt a test suite developed by someone else. You haven’t seen it fail yet. How can you trust that it will react when needed, i.e., when a regression is admitted to the production code?

To get some idea of whether the test case reacts (beyond always passing) on at least some of the production code changes, to feed it different (mutated) variants of the production code. In an ideal case, all assertions of all test cases of the test suite should fire at least once on at least one particular input of a mutated copy of production code.

This type of exploration can be automated. Unfortunately, it is computationally expensive and has some theoretical limitations. But it is still a practically useful approach to estimating quality of test cases and test suites.

Things left undiscussed

There are other important aspects of test behavior that we have not touched yet. Among them are: slow tests which run for longer than acceptable, and fragile tests that fail for reasons that are not connected to the original use case for which the test has been created.

But this article is already too long, so they are left out from it.

Summary

Use coverage tools to the test suite. Consider every line that is not reached for removal, or ask why it is not exercised.
Use static analysis tools to catch copy-paste errors, tautological comparisons, unreachable code, always true conditions, commented out code, overly complex code and other issues in tests.
Use mutational testing to connect individual test cases with parts of production code they react on, and to find weaknesses in existing tests that cannot be detected by other methods.

* There is a third possibility for a test’s behavior to change: its environment has changed. By environment, I mean all the implicit and hard-to-track dependencies of the “abstract machine” that interprets the text of the program: interpreter and/or compiler version and flags, operating system interfaces, hardware quirks etc. The question of testing the environment is outside the scope of this essay.

** A series of physical assertion statements following each other comprise one logical assertion and as such do not usually violate the single assert rule.