Atakua's Diary

Test that times out has a bug in the test, not in the payload

TL;DR: Do not clump timeouts and other test failures together. Timeouts signal a problem with test harness design, not just a bug in the system it controls.

The software simulation field is unique in the following way. The input “data” our simulator program processes is actually “code” of some another program. In a way, you run different code when you change input data. Amount of code not under your control is what makes it so interesting.

Instead of a situation where a program consumes passive data:

+------------+
| data       |
+------------+
| program    |
+------------+

, we have this situation, where a simulator controls some code at input:

+------------+
| data       |
+------------+
| code       |
+------------+
| simulator  |
+------------+

The input code in majority of cases is classified as Turing-complete in its expressiveness. Simply put, it usually has loops (or equivalent control flow transitions) for which number of iterations is not known in advance.

Typical domains with similar structure are built around program analysis and interpretation. Examples:

Antivirus software, which should make decisions about other programs’ behavior (malicious/non-malicious) without actually running them.
Any language VM interpreter, such as Python VM, Java VM etc., has its data dependencies unknown in advance, because they are defined in the input data, not the algorithm of the interpreter itself.

System under test

This brings up to the topic of testing simulator programs, and how their tests should and should not be organized.

Usually, an integration test contains a test script (harness). Its goal is to organize the environment around the simulator, then drive execution until certain conditions are met. After that the script makes a decision about whether the test has passed or failed.

Usually a test script is written using another Turing-complete language. This adds another layer of control to our stack:

+-------------+
| data        |
+-------------+
| code        |
+-------------+
| simulator   |
+-------------+
| test script |
+-------------+

Complications with testing simulators

The Halting theorem states (again, simplified) that it is not algorithmically possible to figure out whether any input program would eventually halt or not. For our application, it is impossible to figure out whether specific input will hang the simulator interpreting it, or not.

From the testing perspective, we want to test our simulator by feeding it different inputs and observing its behavior. We may observe the behavior by waiting for the simulation to reach a specific pre-determined state (== halting), and then compare its outputs (or its internal state) with reference values. Of course, all of this should be automated to exclude human participation, meaning there is a third program, the testing framework that controls the simulator, waits for the halting condition, does the comparison and reporting.

The problem is, we cannot predict if given combination of simulator code/input would ever finish. The testing harness is a program, and by virtue of the halting problem it cannot decide it on its own.

Less important at the first sight, but in fact as important for humans, is the following consideration. If, by some improbable solution, the testing harness could reliably determine the hanging of the simulator under the test, it will still leave the problem of tracking back the reason of hanging to a human.

The intent of comparing the halted state against the reference is to help a human to see what has diverged. An ideal test would also give the explanation why there is a difference. Usually the latter is achieved by failing early, as close to the real cause as possible.

Suppose a program has been reported as hung by the test script. Even if the conclusion were true, reporting it carries no knowledge about why the hang has happened. This means that the test harness fails to help with localization of the issue. We will get back to this later.

Just add a timeout?

We cannot solve the halting problem in general case. Maybe we can invent a good enough partial solution for it?

A simplest and seemingly “good enough” heuristic that humans usually apply in such case is to choose an arbitrary large timeout threshold. Anything that runs above the threshold is considered as a “hang”.

The technical application of the idea above is to have an overseer process (e.g. timeout) that kills the running test if the latter has been alive longer than for a predetermined period.

So finally, the control structure of such test becomes this multi-layered, with at least four programs hierarchically observing and controlling each other:

+-------------+
| data        |
+-------------+
| code        |
+-------------+
| simulator   |
+-------------+
| test script |
+-------------+
| overseer    |
+-------------+

Who is responsible for the timeout?

Recap that software tests should be:

Fast, to help humans to run them iteratively, especially when debugging.
Provide adequate error localization, to save humans time on backtracking from consequences to causes.
Without false positives in respect to their purpose. If a good test fails, something in its area of responsibility has changed. If nothing has changed, that means the test is bad. If something has changed, but it was outside this particular test’s area of responsibility, it is also means the test does more than supposed to.

It is often considered that timeouts reported by the overseer and test script failures should be treated the same. Usually, the simulator code or its input falls under suspicion of being incorrect. My point is that a timeout situation is, first and foremost, an indication of a test harness defect. A fix should be made to the test harness, so that it becomes capable to detect the specific hanging situation in the future.

Let me elaborate of why it is so.

Problems with tests failing by timeout

Surely, a problem in the simulator should still be reported and addressed. But that is not the only problem to be fixed here. Indeed, the problem should have revealed itself as a failing expectation tracked by the test harness. Instead, it was a timeout reported by the overseer one level up.

Why is relying on the overseer to report hung tests as failures bad?

It is slow. A timeout value is usually chosen to be conservatively large to compensate for all performance volatilities of modern computers. Oftentimes it is ten to hundred times larger than a “normal” expected test run time. This means that any problem reported by the overseer takes the worst possible time to report. When a human is tasked to debug such problem, the debugging iterations become bound by the time it takes the test to time out.
It does not localize the problem. A good test harness is written to assert on specific conditions in the program state. A timeout simply says: “something went wrong, human, go figure it out”. Of course it is rarely possible to predict and assert on all kinds of failures, especially in an integration test. Nevertheless, the subset of usual hang causes for the given domain is rather small (otherwise you have a very poorly written software, and should rethink your strategy on ensuring its quality). This handful of conditions can be gradually learned over the lifetime of the test, and incorporated into the list of assertions it tracks.
It is noisy with false positives. A timeout may still be caused by the host machine being too slow because of external factors (resource contention, hardware degradation, overheating etc.). In this case, nothing is wrong with either the test or the program. It will take human time to figure that out all the same, only to end up with a frustration.
It poisons the whole test suite with mistrust. The “timeout as a strategy” combines a single test being slow to run, being unable to localize and being prone to sporadic failures. A test that times out undermines the trust to the whole test suite. Every time the test suite fails because of a timeout, it forces a human to work on localizing a problem without being sure there is a problem in the first place.

What to do

As explained earlier, for a sufficiently complex integration test scenario it is impossible to predict all the conditions when it may hang. Still, it does not mean that the situation is desperate. Let’s see how a test scenario that has a history of failing by timeout can be improved.

Replace the scenario with one that avoids relying on timeouts. This general rule is not always directly achievable. As an example from the simulation world, sometimes you may feed a pre-recorded instruction trace to the model, instead of it being driven by execution. Once the whole trace file has been exhausted, the simulation has come to a definitive end.
Set up specific test assertions to continuously look out for known issues previously reported by the particular test case. E.g., set up watchdogs for specific error log lines. If any of them happens at any point of time, the test case is immediately marked as failure and stopped. The error message will also be more specific, rather than “timeout”. This approach turns the test into an assertion roulette, which is still a test design smell. It is however better than what you had before, and gives way for further test improvements. The timeout condition should still be kept as the last resort of stopping the test case from running amok, but it should rarely be reached. Each incident of the timeout happening must be turned into a new tracked assertion condition added to the test.
Split the test into smaller phases, each with own end assertion and each with its own expected run time. A timeout occurring in a specific phase helps to localize the cause to that phase. Because individual phases are smaller, they will have shorter timeouts, which again helps to cut time until an average failure is reported. Finally, determining many checkpoints in your test scenario means that you understand the whole scenario better, and impose stricter expectations onto it, rather than “let it work for a while and hope for it to reach the desired state”.
Use different notion of (no-)progress to be able to tell that something went wrong. For example, numbers of instructions or statements executed may be stable across all the test runs if they are deterministically defined by the payload. Such approach still suffers from some problems listed above. In particular, it still fails to localize the problems. At least it makes the test scenario less sensitive to outside factors, such as variation in host computer performance.
Instead of piling new assertions to the same already complex integration test, it is preferable to create separate, smaller unit regression tests for previously observed specific conditions that caused hangs in the integration tests. The original integration test should not be allowed to start before these unit tests have reported success. That is, the bigger and less specific integration tests should depend on shorter and more specific unit tests to pave way and minimize the risk of timeouts. This way, it will be impossible for it hang on the same condition unless it has been proven to not be true.
Make tests faster, and drastically reduce the timeout value. If nothing else, the “run-until-timeout” test must fail as fast as possible to speed up debugging iterations. For the simulation, this means having a reduced workload that is generally faster.

The common idea here is to make individual test cases more specific. By splitting them into pieces, speeding them up, making them assert for specific things and so on. This advice applies to all software tests, not only those prone to timeouts.

Each layer watches exactly one layer below it

Another way to look at the suggested measures is to establish this rule. Every component in the testing stack is only responsible for monitoring correctness of one level below it. That is:

The overseer monitors the test script and reports if it misbehaves. The only type of misbehavior it knows is running for too long.
The test script monitors the simulator and asserts for specific end conditions. Any known (== previously observed) types of misbehavior are waited for.
The simulator interprets the code using its VM specification. Usually, program’s own bugs are not treated as direct failures. They are mapped to halted state, or even a restart flow of sorts. For example, a triple fault situation of the x86 architecture causes the processor to restart, which can be and is in fact used by certain operating systems to perform reboot. A true “misbehavior” at this layer may include reaching incomplete parts of the specification, that is, “bugs” the simulator is self-aware about.
The code under test should validate its own data to be correct, and take appropriate action if it is not.

Special situations on deeper layers are not automatically interpreted as failures by the upper layer. For example, many normally functioning programs actively generate hardware exceptions and expect to handle them themselves later. As such, an hardware-level exception cannot be always treated as a test failure. It it the goal of e.g. test harness to decide which of the situations are normal and which are not on the level below it. The burden of the decision making should not leak to higher levels of control.

Such approach helps with error localization in this complex software stack. Following it, a timeout reported by the overseer is always caused by a defect in the test scenario, not lower payloads.

What difference would it make? With the naive sloppy approach, a timeout simply says: “Something may or may not be wrong in any of the layers below”. With this policy enforced, it tells us: “This test scenario is not doing its work of reporting simulator failures. It has to be fixed first”.

Surely enough, failures by timeout do correlate with bugs in deeper layers. No doubt, those bugs should also be fixed. But it should be done after the test harness has been extended to detect this situation without relying on a hang. As before, achieving so may often mean creating a separate isolated unit test, instead of piling new assertions into the existing integration test.

Conclusions

The overseer, the test scenario and the program under test (and all the layers inside it) are separate layers. Each layer monitors issues only in one layer below it.
Tests that fail by using timeouts are very bad for productivity.
It is impossible to predict all the situations that may cause a test for Turing-complete machine to hang. But not all is lost.
As you observe and learn failure patterns of given integration test, turn them into individual unit test cases. Do not run the integration test until its unit test cases have proven that known hangs won’t happen.
Strive to simplify integration tests by splitting them into phases, and reducing their timeouts. Even if a timeout happens, it will happen faster, will point to a specific phase, and generally deal less damage.

A test for which “timing out” is a usual way to signal about problems is a buggy test. I must be fixed to become useful again.