Outcomes of tests
Patterns of behavior around tests
There are patterns in why and how tests fail and pass, and what corrective actions should and should not be applied by developers in response to that. Here are some do’s and don’t s for different good and bad situations.
We have a set of test outcomes:
- pass,
- fail,
- flaky (non-deterministic),
- and hang (timeout).
Note how “flaky” is really not a single outcome but a result of trend analysis. Indeed, for each of such outcomes, there are several situations for the trend:
- Always the same outcome without correlation to changes in production code.
- Unpredictably varying outcome without correlation to changes in production code.
- Outcomes correlate to changes in production code.
Finally, test outcomes may result in a set of corrective actions:
- no action,
- correction in production code,
- change in test case code,
- change in test-runner code,
- removal of test,
- some sort of refactoring of test code,
- or a general deeper investigation of the situation.
Let’s see how these two groups mesh with each other. We will start from the most desirable combinations and will finish with the worst ones.
-
A test failure is reported at an intended assertion. It results in a corrective action in the production code. This is good situation. The test has detected a regression and has lead to correcting it.
-
A test that has never been observed to fail is suspicious. It is worth investigating. Many times it could be a meta-problem with test discovery, i.e., it simply is never run. If the test is indeed executed, then it does not contain an assertion that would take inputs from the production code. A typical corrective action should be removal of said test. An alternative is a significant rework of the test by fixing its logic.
-
A test that fails all the time is overly sensitive and has too many dependencies that are changing too often. One reasonable action is to replace such test with several less decoupled sub-scenarios. There should be as many new tests as needed to limit number of dependencies in each of them to the minimum. Another possibility is to eliminate dependencies, parasitic or non-essential, by mocking them out.
-
A test that fails non-deterministically should be investigated and changed at highest priority. In lack of time, such test must be disabled or deleted. If time allows, dependencies that are not under control (such as bugs in production code or infrastructure issues) should be addressed. What should not be done is to develop an infrastructural solution around this flakiness. Examples of such solutions are: rerunning flaky test several times, marking it with a different color in reports, separating it into a special phase of testing etc. These are half-measures that result in a situation that is much worse. More human time is now spent on manually analysing outcomes of flaky tests. What was meant to be automatic testing had become manual testing: slow, unreliable and boring.
-
A unexpected test failure outcome that has not lead to an expedite corrective action in production code is suspicious. Note the importance of words “unexpected” and “production” here. The test failure was not a part of a TDD iteration (when it would have been expected and desired to demonstrate that the new test case works). If a test failure outcome is corrected by only changing code of the test itself, then it was the test that was buggy. It was not the intention of this test to react on this particular change. Neither was it the goal of the production part to cause the reaction. Tests are not tested by production code. A test case that turned out buggy could be an indication of overly complicated test. In this case, it will benefit from simplification. If the test depends on non-public aspects of production code and therefore has to be changed simultaneously with implementation details to stay in sync with them, then it is a symptom of having too much intimacy with production code.
-
A test that hangs and is killed by the test-runner infrastructure code is super suspicious. The absolutely worst corrective action is to treat timeouts as equal to test failures. They are not equivalent. In other words, it should not be considered normal for a test case to do its job by regularly cause timeouts. Surely, sometimes timeouts could correlate with genuine regressions admitted to production code. But in addition to addressing that, any timeout should be investigated with intent to correct the affected test case itself. Usually that means making the test more specific, i.e. making it monitor for a specific condition that would eventually lead to the hang. Tests that time out are very unspecific: many unrelated reasons cause the same outcome. They are likely to be flaky, they require constant corrections, and they require interactive debugging and manual reruns to figure out the true underlying cause (or to discover that there was no valid cause, just bad luck). In addition, they are usually slower because timeout thresholds are always chosen conservatively high in relation to the average test case run time. But high timeout thresholds mean that manual debugging iterations will be at least as long as the timeout value.
An ideal test case
An ideal test should be 100% deterministic. Every its failure is acted upon by making a change to production code (revert or correction), not to the test itself. When this test fails, it always fails at the expectation/assert phase and never at arrange or act phases.
Once added to the repository, the test’s logic is never partially modified. Most trivial refactorings (such as renaming) are OK though. At the end of its useful life, this test is removed as a whole. It is possible that a new test for a changed use case is added to replace it instead simultaneously with the removal. The new test case would be similar but different.
And the most important of all, the test case is as short as it can be, with exactly one logical assertion and minimum number of dependencies.