Atakua's Diary

When to not prepare an automated test?

Clarifying the definitions
When not to test
Test is too slow
It is already tested elsewhere
The test would be as complex as its production code
Something is trivial to prove to be correct by just looking at it
- How many abstraction layers should a test case tie together?
- About not testing tests themselves
A test cannot provide stable binary output
A human-readable use case is not expressible as a computer program

After having learned the principles and the motivation behind the test-driven development (TDD), I had an unanswered question about the universality of its underlying ideas and their limits.

The ideas are as follows.

Tests provide feedback, you want this feedback early to guide your design decisions.
Tests are also supposed to be mapped to use cases, i.e. scenarios of behavior observable by a user.
A test should not react on changes unrelated to its use case, and react on changes that are connected to it.
Stable and deterministic tests can be run automatically at any moment to verify whether all currently implemented use cases are still covered by the implementation.

It sounds great and is obviously useful. The question is however: are there any practical situations when it is either NOT desirable or possible to prepare automated tests for use cases?

Clarifying the definitions

Before we go on to the cases, let’s define certain things.

A test case is a fixed sequence of actions (done by a machine, a human or another agent) meant to inspect behavior of a system under test (SUT) and which leads to conclusions about the SUT’s behavior.

An automated test case is another program (the test program) that runs without human interaction (beyond the initial triggering that may be done by human). That program interacts with the SUT and is guaranteed to terminate with a binary result. This result is interpreted as good/bad, passes/fails etc. outcome.

A use case is a scenario written in a human language. It is expressed in such a way for a human to understand the scenario well enough to judge that it is valuable and meaningful.

When not to test

Over the past four years, I’ve been thinking about the ways to push these ideas until they no longer seem applicable or useful. Let me share some of my conclusions about the situations when an automated test should not or cannot be prepared for given use case.

Test is too slow

The best feedback is fast. The more time an automated test takes to complete, the less value is has for doing TDD iterations. In the end, people stop running it regularly, and everyone has bad feelings about that test.

A slow test may have slow dependencies. By isolating the dependencies and replacing them with faster test doubles it can be sped up. The matching use case may be simply too big. Splitting it into a sequence of smaller use cases and creating corresponding independent automated test cases for them is another way to improve it.

It is already tested elsewhere

It is not uncommon that something is already done but you do not know about it. If the software functionality is not non-ambiguously mapped to test cases (and it is very hard to achieve in big projects), the use case may be already covered by an existing test.

In the TDD spirit, it helps to ensure the new automated test fails first. If you cannot make it fail, sometimes it means that it is already implemented, or else you has some other gap in understanding the system that has to be closed.

There is no point in testing something twice. The outcomes of two such tests will always be 100%-correlating, so the second case will not be giving any new information, while spending computer time.

The test would be as complex as its production code

If you notice that the test code starts to closely repeat the production code it is meant to test, it is time to stop and reconsider. This situation is known as a tautological test. A couple examples.

A test that checks that a specific string present in the production code is returned repeats that string verbosely in the assertion.
A test double (a mock) repeats essential parts of the business logic its production counterpart contains.

Test cases should remain as specific (and as simple) as possible, while the production code should evolve towards becoming more generic as new use cases are added. From the TDD perspective, it is fine to have tests and production code almost identical in the beginning, but then they must evolve away from repeating each other.

Another example from my practice is when you need to implement a finite state machine (FSM). To test it, you usually ending up reimplementing the same machine (maybe using a different programming technique) as the whole thing, but in the tests. All the valid state transitions form a data table. There are not many unique ways to express the same table. We will get back to this FSM example in a second.

What does all of this tell us? It may be a sign that these automated tests were prepared on a too low level of abstraction. Which naturally brings up the next case.

Something is trivial to prove to be correct by just looking at it

Let’s think about the FSM example once again. The main action any FSM does is the state transition. We can cover this requirement with a first test case. Another thing the FSM does is to invoke an action associated with given state transition. This we can cover with a second test.

Can we prepare any more non-tautological tests after that?

What’s left in the FSM is the knowledge of what before-states there are and to which after-states they jump to. And it is in fact data, not code. Data is immobile and has no behavior. You cannot test data (at least not in the same way as we test behavior). An attempt to check some or all state transitions and effects of actions means that you’d essentially reimplement the state machine, partially or in full, but this time in tests.

The data table we are talking about is either correct for the given application or not. From this perspective, it makes no sense to test this data in isolation. What we should do is to test both the FSM and its context together. I.e., we should test at a higher application layer.

How many abstraction layers should a test case tie together?

To bring in the FSM callers into the test means that several levels of abstraction are exercised in that test case.

How many levels should be tested simultaneously? Apparently one is too few, as it leads to tautological tests.

I’d argue that the number should be dictated by how many layers of abstraction can a human operate with efficiently. And the practice shows that the limit is not very high.

A test case that ties two (or maybe three) layers of abstraction is simple enough to reason about and complex enough to capture behavior that cannot be said to be trivial. That is what we should target when formulating TDD use cases.

Conversely, if a test case goes through more than four levels of abstraction, it becomes harder to understand, and problematic to maintain without mentally untying the layers, i.e., without debugging it.

About not testing tests themselves

The idea that some code might just be too obvious and tests cannot be written for it should help us to answer another question: “what tests the tests?” Or rather, “what program does automatically test the test cases?”

It is certainly not the production code we could trust to do that (why it is so is explained here). We certainly could write special next-level meta-tests which treat the original test cases as their production code, but doing so only pushes the original question onto the meta-level.

It appears that the answer to the question is: nothing automatically tests the tests. But that should not alarm us if we adhere to the principle that test cases are created to be trivially simple.

The absence of automated testing of tests does not mean that we simply trust them to be faultless. In TDD, any new test case is effectively manually tested by running it twice, each of the runs done against different production code. The human is required to visually observe both outcomes. It is then taken as a proof of the test’s correctness. At later later, we rely on the axiom that an intact program preserves its semantics. Therefore we do not need to retest the case as long as it and its associated use case stays the same. What happens if we do need to

As one corollary of this, an ideal test scenario should not contain any branching at all. Otherwise we will have to run it twice for each if to ensure that all code paths are covered and behave as expected. The implicit branches hidden inside assertion/expectation expressions do not count, as they are not part of the test case but rather are a part of the testing framework. You could say that cyclomatic complexity of any test case always ought to be equal to one.

A test cannot provide stable binary output

We posited that an automated test must return a binary (true/false) result. Why is it useful? Because then we can combine multiple automated test cases into automated test suites. A test suite passes if all its cases report success. As soon as at least one of them reports failure, it means that the whole suite fails, and a human action is needed.

“Stable” means that the outcome of a test does not depend on program’s aspects not related to the connected use case. If all test cases in the test suite are stable, it makes the suite itself stable. That in turn means that any failure will be strictly mapped to individual use cases, rather than possibly being a false alarm.

Not every use case formulated in human language conforms to this requirement. A popular example: a GUI program does a series of actions and finished with certain graphical screen state. A use cases for this program mentions how the screen should look like. A screen typically contains millions of pixels of different colors.

A naive automated test would implement this requirement by comparing the screen state against a reference image stored somewhere. Surely, the comparison result will be binary: the images are either identical or different. But it will not be stable — a single pixel changing color because of any unrelated modification in the production code will cause the test to react. It will then require a human to look at the new picture, make a decision if it still corresponds to the use case or not, and if it still does, to update the stored reference image. In other words, this failure will be classified as false positive.

Such mandatory and frequent human intervention required to handle false positives means that such a test is not really automated. Instead of the dichotomy: “the use case holds” or “the use case does not hold”, — it produces three outcomes represented by its two return values:

“the use case holds because no change is detected”,
“the use case holds but unrelated things have changed”, and
“the use case does not hold”.

The last two cases require human analysis to tell them apart, because they both are mapped to the same return value.

For this use case, we were not able to write a good automated test.

The idea of humble objects is that you have thin layers on the boundaries of your program that are not covered by automated tests, because for the reasons above, proper automated tests cannot be created for them. You’d want to keep these layers thin with as little logic to manage as possible. Typical examples are: UI and especially GUI, network and other types of I/O, database connections.

A human-readable use case is not expressible as a computer program

Many correct informal and formal statements about program behavior cannot be verified by any program. This is what the Rice theorem postulates.

This means that it is possible to have a use case that is reasonable to a human and is valuable for a program to demonstrate, but which cannot be accurately mapped to an automated test.

A few examples of such use cases that can be said about a program. “A program” here could mean a whole thing, a module, a class or any other subsystem that could be approach for testing.

The program does not contain viruses.
The program terminates, i.e. it does not hang (see the halting problem).
For a program written in C/C++, it demonstrates no undefined behavior. A particular subcase: the program does not crash.
A program is identical in behavior to another given program.
All the problems listed here.

It is typical to substitute such “untestable” use cases with ostensibly closely related but in practice more restrictive and therefore weaker statements, but which can be turned into automated tests.

Example of such simplifications for some of the problems above.

Instead of trying to examine program’s behavior to determine if it is free of malicious effects, its code is searched for byte patterns of known malware. This approach is not free from either false negatives (not yet known malicious pieces will not be reported) nor from false positives (a “dangerous” code pattern may probabilistically occur in a place where it can never be reached during the actual program execution). Thus, it is not equivalent to what the original use case tried to stipulate.
The “does not hang” requirement is replaced by “terminates under predetermined amount of time”. They are not the same. The second condition is not stable, because the outcome of timeout necessitates human analysis, making it into a non-automated test.
The “does not crash” statement is often replaced with “does not generate an OS signal”. The second statement makes the test dependent on the host platform (compiler/OS) implementation. The original use case was formulated for an abstract machine.