Atakua's Diary

Test code should not be Turing-complete

The ideas below were inspired by rereading of Tom Stuart’s Understanding Computation book, watching Uncle Bob’s Clean Code videos, and my thoughts about the nature of the problems I observe in daily work with certain tests.

In theory, there is a common agreement that simple tests are better than complex ones. In practice, some collectives rely on integration tests as their only source of confidence about the state of software projects they maintain.

Here is the problem with integration tests: you do not really know what you are testing when you run them. You just hope that a test this big will catch as much of interesting stuff. But hope is not a strategy, and this comes with a price. But let’s talk about the most annoying thing first.

The anti-pattern
- Meaning of a timeout
Do not rely on timeout too much
Can we avoid having a timeout?
Is it possible to avoid being Turing-complete?
How to limit damages from Turing-completeness

The anti-pattern

One persistent testing anti-pattern that I encounter goes about the following organization of a test.

Set up a large software configuration with many components connected to each other (system under test, or SUT).
Arrange starting states of all its components.
Let the SUT evolve on its own for some time, but no longer than a certain predetermined timeout threshold.
Compare evolved state of selected components against pre-recorded reference values.

At first, it follows the arrange-act-assert test design principle. But a very fragile assumption is hiding in plain sight.

The “Let the SUT evolve on its own for some time” step is very problematic here. Let us first define it though.

Meaning of a timeout

If we allow the SUT to do what it pleases, it may never return the control back to the test harness. To prevent this, a timeout event is set up. In its simplest form, a watchdog monitors wall clock time while the SUT is running. If the clock has advanced to a predetermined threshold and the SUT still has not terminated on its own, the watchdog signals the timeout, which usually causes the program to be forcefully terminated soon after.

A more general definition of timeout that we will be using replies a broader notion of “time”. Any (quasi)periodic event regularly and inevitably appearing during program’s operation can be used. We then set up a limit of how many those events we allow the watchdog to see before it is considered to be too much.

Just a few examples of such events: number of machine instructions executed, calls to a specific subroutine, or transactions made by the business logic layer.

Do not rely on timeout too much

In the absolute majority practical uses, reaching the timeout meant the test scenario did not succeed. But did it really fail? Or should we treat this outcome as something alternative to both outcomes?

Let’s outline first what expectations we should have from a good test.

An ideal test

Any test should help you maintain good behavioral properties of software, while catching unintended changes. An ideal test:

Never fails when no behavioral change worth reporting has been made.
Always fails when there is a valid reason to report a behavioral change.
When it fails, it points as close to the real place of the change as possible. It helps understanding the cause of the reported change.
It does not error, i.e. its failures are always problems in the production code, not in the test’s code itself.
It is fast to report either of outcomes (both failures and successes).

Let’s look at how these properties are not achieved with timeouts.

1. Zero false positives

For a test relying on timeout, this property cannot hold. It is clear for anyone who’ve tried to write (non-hard-real-time) code which expects certain processes to take predetermined amount of time. Such programs are just inherently fragile.

If the timeout is defined in terms of wall clock time, then it means we mix an uncontrollable external factor into the list of inputs affecting test outcomes. A host of external and uncontrollable reasons may cause a test to sporadically fail. A temporary slowdown of the host system caused by any reason (host CPU frequency throttling due to overheating, heavy swapping because of other programs contending for the resources etc.) has a chance to manifest itself as test failure.

If the timeout is expressed in terms of events internal to the configuration, then the situation is only marginally better. Surely we have more control over such internal measure of progress. E.g., a timeout situation will be reliably reproducible, because no random inputs would be affecting its outcome.

But relying on a timeout still means that, when designing the test, we have given up the possibility to know what the SUT does at any given situation, and replaced this certainty with a hope that it will eventually reach one of the expected states.

2. No false negatives

Strictly speaking, true positives are not affected if a test is using a timeout as one of its failure conditions. However, they may stil be masked by the timeout event, if it is reached before the true assertion (and successful termination) has been reached.

The combination of №1 and №2 makes the test indecisive. A passing test still tells the truth about the SUT being OK. But a failure-by-timeout means: “either the SUT’s behavior has changed in an expected but undesirable way, or you have been unlucky, or something else”. Go figure.

3. Assertion points close to a cause

A failure-by-timeout condition helps poorly in localizing sources of errors. Compare this to a regular assertion which usually points to specific state divergence.

It is similar to why “try—catch all” blocks in production code are bad. It smells with fragile design. The “try to run and fail if it took too long” testing approach clumps all the expected and unexpected reasons together, destroying important details useful for investigation. Debugging such test is the only way to restore this information. Debugging is very costly, and the test is not designed to provide the information it was supposed to.

Surely, it would be nice to catch all the errors. But the price of having non-specific reporting is too high.

4. Does not error

There is an important distinction between a failure and an error in a test. A failed test has stopped where its author has intended it to be able to stop (usually at an assert, expect statement or an equivalent). An error in a test happens elsewhere and stops its flow before all assertions have been reached and checked.

Oftentimes, an error is reported at an earlier phase, in arrange or act phases of the test. Conversely and by definition, a failure can only happen inside the assertion phase.

A test error oftentimes means that some preconditions of the test did not hold. A fix to an error should be targeted to satisfy the failed precondition. It should not have much to do with the production code exercised in the test. Doing so would be treating only symptoms, not their cause.

Compared to that, a test failure means that the test is actually doing its job as intended. An expected reaction to a failure is a change in production code.

For this reason, it is valuable to recognize, report and treat test errors and failures separately.

How is this distinction affected in presence of a timeout condition? It becomes much harder to tell errors and failures apart. A timeout cannot be classified as a test failure, because the production code that “executes for a while” does not contain testing assertions inside it. It is so by definition — we are running production code. Any other types of checks placed in it (e.g. runtime assertions), if they do trigger, should be classified as test errors, i.e. failing preconditions making the whole test invalid until they have been fixed.

For this reason, a test relying on a timeout to stop it from running forever automatically tells us: “I contain unknown errors, because part of the behavior is unspecified”.

5 Test is fast

A successful test run may be fast. But if its logic contains a timeout clause, the threshold for it is usually chosen to be conservatively high in relation to the median test run time. This is to minimize statistical probability of unrelated external inputs affecting the test outcome. E.g. for a test running for about 100 seconds, it is not uncommon to see its threshold to be set to 1000 seconds.

This makes such test unfriendly to human inspection exactly when we need it most, i.e. when it reports a failure. A failure by timeout will of course take the longest possible time to run.

This means rerunning the broken test during debugging iterations is very slow. We have to wait until the timeout clause to stop its execution every time we have made a change and wish to inspect its effects on the test.

Can we avoid having a timeout?

It should now be clear why this testing anti-pattern is bad. How is it possible to stop relying on timeout?

The halting problem tells us that, for computers equivalent in their power to Turing machines, it is impossible to decide whether they will ever stop on all inputs. Tests are programs, so this limitation applies to them as well.

Enforcing a timeout is one way to make a test be less general in its computational power than a Turing machine. But by doing so in a general case, we are giving away control the test should have over the SUT, and knowledge about what the SUT is doing versus what it is supposed to do.

Is there any other way? If we designed our tests differently, could we avoid this weakness?

Let’s remember that there are other useful computational constructs which are strictly less powerful than Turing machines. I am talking about finite automata, context free grammars and push-down automata. For these machines you can determine whether they will stop or not. Surely they cannot represent an arbitrary computation. But that is exactly why we should use them for defining test harnesses.

In practice, this still means avoiding writing tests that:

have loops with unbounded number of iterations, or
stimulate the SUT in a way that causes it to perform unbounded number of iterations while calculating the reply.

If the test follows the arrange-act-assert pattern in its organization, it is most likely the arrange and act phases that should be restricted in how they operate.

Is it possible to avoid being Turing-complete?

Note that I am stating that tests by itself should not be overly powerful. But tests exercise the SUT, and some components of SUT may happen to be Turing-complete. Addition of whatever practically useful test harness built around such SUT will not reduce its total computational power.

Let’s not forget that the host processor that executes actual machine instructions is Turing complete, as is the microcode used inside it. So, it is Turing machines all the way down.

In the face of these challenges, it is strange to tell: “do not build a Turing machine”.

How to limit damages from Turing-completeness

You cannot avoid the potentially disastrous effects of the complexity hiding inside the nasty machine. But you can try to limit the scope of the uncertainty they threaten to create when a test fails by timeout.

Remember that tests are meant to help you pinpoint unintended changes in SUT behavior. A single timeout built around a big test leaves a huge scope for debugging in a case of timeout. To limit such damage, the initial creation of an integration test should be followed by these steps.

Split the scenario you test into smaller phases. Each phase should have its own assertion at the end to check that the SUT state is still within the expected.
Guard each phase by its own watchdog. Because the phases are shorter than the whole, their timeout thresholds will be smaller.
Whenever a test fails by reporting a timeout, fixing its cause in SUT is preceded by improvements in the test harness. A new assertion clause should be added to the test, that focuses on a newly discovered failure mode. Oftentimes splitting the test phase into new smaller phases is the best way to integrate the newly obtained knowledge.

In the end it comes down to having shorter, more focused, faster completing tests. Debugging them becomes easier as they tend to report more specific information closer to the point in SUT’s evolution where a divergence has appeared. Even if everything else fails and the test case is interrupted by a timeout, it happens earlier and the created uncertainty spans over a smaller scope.

Moreover, because such test phases depend on each other, a failure in an earlier phase saves time. There is no need to spend processor cycles attempting to run later phases which are already doomed to fail (or even worse, pass because they fail to sense for an earlier problem). That creates less distraction for the person analyzing new regressions in a big test suite.