Atakua's Diary

Four outcomes of a test

The triple A test structure
Pass at the Assert
Fail at the Assert
Fail before the Assert
No termination

Before we talk about how a test case could end, let’s recall its canonical structure. It is needed because we will refer to events that can happen in test’s different phases.

The triple A test structure

A good test case comprises of three A-phases:

Arrange to construct the system under test (SUT);
perform Act on it, when the behavior we want to test is invoked;
to inspect the resulting SUT state, the Assert phase contains one or more expectations to be checked.

Sometimes the Arrange phase can be empty, when no objects need to be constructed. For example, when testing a compile-time defined function that takes immutable inputs, there is not much to arrange because the all the components already exist; they will be combined during the Act phase.

A missing Act phase usually means that it is still present but is hidden inside another phase. It could be that the SUT construction process creates state that needs testing, i.e., it is the hidden Act. It is better to clearly separate the Act to make it obvious what exactly we are testing.

A missing Assert means that the test case that does not really make any explicit statements about the SUT. We will see why it is bad.

Mixing up or repeating the phases usually means that more than one test case is hiding in the test code. Those intertwined cases can be and should be isolated from it.

Note that several related test cases may have shared arrange phases, states that derive from each other. But we still should be able to interpret them independently.

Let’s see what outcomes we can inspect after running such a test, and what typical actions could be done after them.

Pass at the Assert

After all the expectations inside the Assert phase have been checked to hold, the execution has passed trough the test’s last phase. There is nothing more for the test to do, so it terminates. It reports nothing beyond the successful exit code (maybe an occasional “OK” or “Test passed” string is printed, which is by the way redundant).

In many but not all cases, this is the outcome we hope for. The pass is also called “success”, which is unjustly biased. How it should be interpreted is that a previously established use case associated with the test case holds. All of that is true assuming the test itself is bug-free.

For a passing test, there is usually no corrective action taken by human. Unless, of course, you’ve expected it to fail instead.

Fail at the Assert

This is another reaction that should be built into any test case. One of the expectations of the Assert phase does not hold: a boolean predicate about the SUT is false, an element of SUT’s state does not match its reference value, a certain exception has or has not been generated by Act, and so on.

As with the pass, this outcome is unjustly called a “failure”, which brings negative connotations with it. Instead, I prefer to describe this situation as the test case reacting on a change in the production code. It has a more positive feel to it. After all, we’d expect the test to help us identify regressions, and it has done its job, so it cannot be called a failure.

There could be several meanings of this reaction.

The failure can mean that a previously established use case no holder holds. Most likely we have introduced a regression into the production code in a recent change, which we now need to roll back and re-do more carefully. Or we might have to debug the situation to help us fully understand the connection between the change and test failure. Debugging is less desirable than a simple revert, but it is sometimes unavoidable. Finally, it could also be that the test itself is too sensitive to some unrelated changes in production code. See more explanation of this situation here. A usual fix is to adjust the expectations in the Assert to reflect the expected state of SUT. But this is also a sign that maybe this test checks too much and should be reworked into smaller more focused and less fragile tests.
It can also mean that a known new use case is indeed not yet implemented. In this case the failure is a normal stage of the test-driven development iteration. After seeing it, we will switch to implementing the corresponding piece of production code that would allow the reacting test to pass.

Now that we have discussed the two main “normal” outcomes of a test run, let’s shift our attention to less obvious but not infrequent modes of failure. It is extremely important to be able to recognize and properly deal with those too.

Fail before the Assert

A test failure inside either Arrange or Act phases should not be mixed with reactions coming from the Assert phase. A properly designed test is only meant to inspect the SUT and react on it inside the Assert phase. That’s where all the expectations are placed. Anything else should be classified as a test error, and this time we do want to attach a negative feeling to the words.

Surely, a test failure in any of the phases indicates that something in the production code and/or its execution environment has changed since the last time the test was passing. But a failing Arrange or Act phase does not tie the situation to any use case. The first two A-phases are implementation details of the test and not directly tied to the production code. Failures in those phases do not give us direct information about the production code.

This misplaced error is very unspecific. Understanding and addressing it will require costly debugging. As the result, the production system might need a change. But we should not stop here. The test itself should also be changed.

A failing Arrange or Act phase means that it depends on something that we do not fully control. A test stopped too early fails to answer the only question it must answer: does this particular Assertion hold? To improve on this, we should strive to break that dependency off. It could be done in several ways.

We can make the test simpler, with less involved Arrange phase that has fewer or less complex or more reliable dependencies. This is where using test doubles (mocks and spies) could help. But the approach of splitting a test into smaller simpler ones is still the best.
If the newly discovered dependencies are parts of the production code that we still want to use, we should start testing them to be healthy before anything could rely on them. This means creating prerequisite tests for those dependencies. Those tests will be smaller and more focused. What is more important, they should gate the invocation of the original test that allowed us to discover them. If any of the prerequisite test fails, the original test case is not allowed to be started in the first place. This way, we won’t be confused by seeing the same error in it again. Instead, reaction of its prerequisite test will tell us where to look for the regression.

Let’s look at an example. Suppose you have test case for certain computation that occasionally errors because a file that it opens in its Arrange phase is placed on a network drive, and the networks are not as reliable as we imagine them. But this test case is not meant to be testing the network; so such a failure is not a “good one”.

We should simplify the test by moving that file to a local disk before starting the test. This way, we remove the dependency on the network. Alternatively, the file copying step of the Arrange phase could be extracted into a new prerequisite test that focuses on ensuring that the network is indeed available. A failure to copy the file would mean that there is no point in testing the computation, so the problem is properly localized by the prerequisite test, and the computational test stays out of the picture by not being run at all.

The best course of action should be to stop using the file as this test’s input altogether. This way, it will no longer depend on the filesystems, local or network. Instead, we should incorporate the file’s contents into a test double object placed directly inside the test case.

To summarize, a test error before Assert means there is a bug in the test, and it is the test that must be fixed first; production code may be fixed next.

No termination

What is left to discuss is one more “outcome” of a test: no outcome. The language used for producing tests is Turing-complete, and the SUT is often written using the same or another similarly expressive language. It is thus possible that a given test program won’t ever terminate with any result.

It is not possible to exclude this possibility in an automatic, programmable manner. The halting problem shows that no amount of clever test logic is able detect whether a program for a Turing-equivalent machine will terminate or not.

The usual workaround is adding an external timeout, after which the test is interrupted by an outside monitoring watchdog. Just as it was previously with tests that error out before completing their Assert phase, a timeout error does not tell us anything about the state of linked use case. And similarly to the buggy tests, a proper fix should be made to the test case itself, not the production code it tests, even if the watchdog interruption, when it strikes, catches the execution inside the SUT.

I already wrote about the problem of hanging tests here, and there I focused on several approaches towards how such tests should be dealt with to avoid timeouts in the future.

With a test case strictly following the triple A structure, code of the test file itself should not contain any loops with unbounded number of iterations. But its calls into the SUT may interact with a Turing-complete machinery.

It is unreasonable to expect that all production code can be written without employing unbounded loops. Turing-complete languages exist because they are useful to represent real-world concepts. However, by minimizing the number of use cases that invoke such “unsafe” loops, you will minimize number of test cases that need to call production code for which no guarantees of termination are given.

It is still OK to have a few unit tests that you cannot guarantee to never hang. The same way, it is desirable to have enough integration tests, and those oftentimes have many Turing-complete machines interacting with each other. Striving to limit their amount means that most of your test cases will be dealing with less powerful automata.

Achieving this ensures lower test execution time, test cases that localize problems better, simpler test maintenance, and less need for debugging when the only reaction a test case gives to you is: “something has changed, human, go investigate”.

In the end, a test case that always succeeds to complete its Assert phase is more likely to tell you the truth about SUT’s behavior exactly when you need to hear it, and in a best form.