Incremental metrics provide most value to the development

Recently I realized that there is a common pattern for many “testing” activities undertaken during the daily development cycle. Many of them provide the biggest benefit if their results are reported in an incremental manner, i.e. against individual change requests, rather than against the whole code base.

Examples: code coverage, performance benchmarking, static analysis reports. The regular unit/regression/integration tests are slightly different, I will get back to them after I’ve illustrated my point.

This is the case when the code base is old and large, and many people are working on it concurrently.

  1. The statistics collected for it as a whole are changing too slowly to be actionable.
  2. Even if any deviations from the norm are detected, tying them back to a cause requires a new “project” to be scheduled. I.e., it feels like a major undertaking to address any of the regressions, while everybody is already busy with the next thing.
  3. They are usually reported no often enough. Usually teams are fine with having nightly runs for the main branch. The results are reported next morning.

The same statistics, if scaled down to the individual change requests, exhibit better properties.

  1. The results are proportionally amplified, meaning that any regression become obvious.
  2. They are tied to a specific persons who’ve contributed code that caused them,
  3. They are faster to calculate and report. In the ideal case, they can be reported as annotations to the same change request even before it gets merged to the main branch.

Let me exemplify these with comparisons of how global and per-change reports are different on three examples: test code coverage, performance benchmarking, and static analysis reports. All these examples are taken from real projects I have worked with.

Test coverage

The test coverage reported for the whole project as a sum of lines covered divided by total number of lines the project has is an utterly useless number. The only exception would have been if it was equal to 100%. In practice, it is not reachable nor usually desirable.

Striving to maximize such a metric has limited value to the project. Doing a sum over everything means that both important code and some legacy forgotten and disused subsystems are treated equally.

A better approach is to drill down into individual subsystems with low coverage to see how (un)important they are. Based on the more focused analysis, a number of decisions may be made. Two most often taken solutions are: 1) to delete old uncovered and unused code, and 2) to spend more time on testing insufficiently covered but important code. Notice how either decision cannot be made based on the global metric, but requires drilling down to narrower scopes.

Which code is more important than a code just recently updated? Providing an incremental code coverage for change requests proved to be very useful for catching silly but realistic mistakes and quality problems.

A typical situation that may raise an alarm at a code review is that the test coverage for touched code is unexpectedly low. In my practice, it may be caused by several reasons.

  1. Changes to production code were not accompanied by relevant tests. Usually it is already obvious by simply looking at the diff alone. Incremental test coverage provides the hard evidence, backing up the reviewer’s decision to reject such a change on the grounds of being poorly tested. While the lack of tests is uncommon for well aligned highly professional teams, if there are external contributors unfamiliar with better development practices, it would not hurt to ensure that their contributions are well covered.
  2. The tests are be present in the diff, but are accidentally disabled, or rather forgotten to be enabled or made non-discoverable. This happened several times in my experience: misnamed or mistagged files, a forgotten debugging early exit etc. A human looking at the code does not always catch such an omission. No matter the reason, the fact will be visible in the incremental test coverage report.

In rare cases, low incremental coverage for a given change request is expected by both its author and the reviewer. Such changes may still be merged. However, each such case should be backed by its business value exceeding the drawbacks. A decision made by humans operating with real data for just that change, not just a global metric mixing everything together.

Benchmarking

I would not state that benchmarking of your project as a whole, done on a few of realistic representations of customer use cases, is totally useless. In my experience, it does allow to catch blatant performance degradations caused by careless changes. A forgotten debugging print left on a critical path can usually be recognized as a sudden dip in the total results. Taking an action still requires drilling down to the module, method and basic block profiles.

Unfortunately, for everything else, trends of the project performance is usually lost in the measurement noise. This noise is caused by the numerous non-deterministic factors of our computers that we cannot control: varying processor frequencies, contents of caches, network fluctuations etc. etc. No performance measurement experiment is usually free from them, no matter whether it is a macro- or microbenchmark.

As such, performance degradations of a smaller scale are usually not treated until a long time since their introduction has passed. At that point it is usually very hard to track the negative effects back to the change(s) that caused it.

Associating smaller microbenchmarks with subsystems of the whole code base gives a better locality for the data they produce.

What microbenchmarks do allow you to see are amplified effects of your change to specific subsystems. Not every degradation would require an immediate action however. Neither will every not fully understood performance regression will necessarily block the change request from merging. That is because not every subsystem is equally performance-critical to the quality of the project.

The runtime of big benchmarks based on the customer use cases is not always under our control. A use case might dictate that a large chunk of data has to be processed, thus making the run time unacceptably long.

The runtime of more focused benchmarks is under better control of programmers. Because specific known subsystems are targeted, the volume of data they should process during an experiment may be big enough to get the results with narrow enough confidence intervals (i.e., with most of the measurement noise averaged out), but still small enough to be fast for quick feedback loop.

Static analysis

Here I refer to the static analysis in its widest meaning, as program’s automatic partial interpretation. A common property of any static analysis tool is a higher chance of false positives: reporting issues that turn out to be important after a closer examination. Humans are very adaptive to ignoring problems that are not immediately pressing for resolution. Any long list of problems will soon be ignored. It would not matter that fresh truly important issues could be appearing in this list every day.

Because of that, it is critical to carefully ration such feedback. If it strongly correlates with the code a human has recently been working on, the chances of the issues to be addressed are much higher. Conversely, if only infrequently reported as a long list of problems all over the place, nobody in the team will take it seriously.

In my project, a certain static analysis tool was originally introduced under the pressure of the corporate compliance. But it was done without ensuring that the developers’ understood how to extract value from it. In fact, it was added with seemingly wrong goals in mind. As the result, the scan was slow (and nobody was trying to improve it), it was only done once per day, and it did not gate the release in cases the results had worsened. Unsurprisingly, this static analysis was essentially ignored by everyone.

Then the scanning process had been sped up to the point when it could be run frequently. Then every change request would be annotated with its results. As a tool for annotating change requests, the static analysis finally started to make a difference by catching issues not detected by the compiler or tests.

We still run that old regular nightly old style static scanning job for the whole project. Nobody still looks at it, but for a different reason: why bother if the daily regular scans report the same information earlier?

Regular test suites

Note how all the metrics above can be approximated or represented by continuous values: test coverage as percentage, performance as a floating point number, density of static analysis errors. For the regular test sites, the situation is more binary/boolean.

In a healthy software project, its test suite is either passing at any given change request or not. The outcome is defined as a logical AND of results of all test cases constituting the suite. There should be no difference between applying the test suite to the full code base and to a smaller change request. The boolean outcome must be the same in both cases. The run time may be sometimes optimized in incremental runs by skipping subsets of tests that can be proven to be unrelated to the change, i.e., outcome of which is proven to be unaffected by the change.

In less healthy projects outcomes of test suites are regrettably continuous values. For them, it is rarely the case that absolutely all test cases are passing. Such projects resort to a weak definition of success by having tests being good enough. Usually, it means that a certain high percentage of tests happen to pass at given moment of time.


Written by Grigory Rechistov in Uncategorized on 06.03.2023. Tags: tests, benchmarking, static analysis, incremental analysis,


Copyright © 2023 Grigory Rechistov