Incremental metrics provide most value to the development

Updated 2023-07-06

Recently I realized that there is a common pattern for many testing-like activities performed during daily development cycles. Many of them provide the biggest benefit if their results are reported in an incremental manner, i.e. against individual change requests, rather than against the whole code base.

Examples of such processes: code coverage, performance benchmarking, static analysis reports. However, regular unit/regression/integration tests are slightly different in the nature of the result they return. I will get back to the tests after I’ve illustrated my point about other processes.

I am considering a case when the code base is old and large, and many people are working on it concurrently. Small projects won’t see as much difference between the incremental and global results, because such reports are likely to be of the same size.

For a bigger project, the following applies for the global metrics.

  1. A statistic variable calculated for the whole thing is usually changing too slowly to be actionable.
  2. Even if any deviation from the norm is detected, tracking it back to its cause requires a new development/debugging story to be scheduled. I.e., investigating it feels like a major undertaking. It is hard to quickly address any of the reported regressions, because everybody is already busy with the next thing in the backlog.
  3. They are not reported often enough. Usually teams are content with having nightly runs for the main branch. The results are reported next morning. This is not fast enough.

If the same statistical metrics are instead calculated for individual change requests, then they exhibit better properties.

  1. The results are proportionally amplified. Without the influence of the rest of the codebase, a regression is more visible.
  2. Reports are tied to a specific person who has recently contributed matching code change.
  3. Numbers are faster to calculate and report. In the ideal case, they can be used as annotations to the same change request even before it gets merged to the main branch.

Let me exemplify these with comparisons of how global and per-change reports are different on three examples: test code coverage, performance benchmarking, and static analysis reports. All these examples are taken from real projects I have worked with.

Test coverage

The test coverage reported for the whole project as a sum of lines covered divided by total number of lines the project is an utterly useless number. It is very hard to act on it. The only exception would have been if it was equal to 100%, and there is an attempt to never drop it below that.

In practice, it is not reachable nor usually desirable to have a 100% code coverage. Striving to maximize such a metric has limited value to the project. Doing a sum over everything means that both important code and some legacy forgotten and disused subsystems are treated equally.

A better approach is to drill down into individual subsystems with low coverage to see how (un)important they are. Based on the more focused analysis, a number of decisions may be made. Two most often taken solutions to the problem of low coverage are: 1) to delete old uncovered and unused code, and 2) to spend more time on testing insufficiently covered but important code. Notice how either decision cannot be made based on the global metric, but requires drilling down to narrower scopes.

What code is more important than a code just recently updated? If a programmer changes it, it means there is a need to change it, i.e., a customer actually runs this code and wants a different behavior. Because of that, reporting test coverage for recently modified code provides more insight into how good/bad the usable parts of the project are.

Providing incremental code coverage for change requests proved to be very useful for catching silly but realistic mistakes and code quality problems.

A typical situation that may raise an alarm at a code review is that calculated test coverage for code touched in it is unexpectedly low. In my practice, it was caused by a few reasons.

  1. Changes to production code were not accompanied by relevant tests. Usually it is already obvious by simply looking at the diff alone. Incremental test coverage provides the hard evidence, backing up the reviewer’s decision to reject such a change on the grounds of being poorly tested. While the lack of tests is uncommon for well aligned highly professional teams, if there are external contributors unfamiliar with better development practices, it would not hurt to automatically prove that their contributions are well tested.
  2. The tests are in fact present in the diff, but are accidentally disabled, or rather forgotten to be enabled or made non-discoverable for the testing framework. This happened several times in my experience because of misnamed or mistagged files and forgotten early return statements left after debugging. A human looking at the code does not always catch such an omission.

No matter the reason for test coverage to be missing, it will be visible in the incremental test coverage report.

In rare cases, low incremental coverage for a given change request is expected by both its author and the reviewer. Such changes may still be merged. However, each such case should be backed by its business value exceeding the drawbacks. The decision is made by humans operating with real data for just that change, not just a global metric mixing everything together.

Benchmarking

I would not state that benchmarking of your project as a whole, done on a few of realistic representations of customer use cases, is totally useless. In my experience, it does allow to catch blatant performance degradations caused by careless changes. A forgotten debugging print left on a critical path can usually be recognized as a sudden dip in the total results. Taking an action still requires drilling down to the module, method and basic block profiles.

Unfortunately, for everything else, a project performance trend is usually lost in the measurement noise. This noise is caused by the numerous non-deterministic factors of our computers that we cannot control: varying processor frequencies, contents of caches, network fluctuations etc. etc. No performance measurement experiment is usually free from them, no matter whether it is a macro- or microbenchmark.

As such, performance degradations of a smaller scale are usually not treated until a long time since their introduction has passed. At that point it is usually very hard to track the negative effects back to the changes that caused it.

Associating smaller microbenchmarks with subsystems of the whole code base gives a better locality for the data they produce.

What microbenchmarks do allow you to see are amplified effects of your change to specific subsystems. Not every degradation would require an immediate action however. Neither will every not fully understood performance regression will necessarily block the change request from merging. That is because not every subsystem is equally performance-critical to the quality of the project. This is similar to missing code coverage for subsystems that are not as important.

Runtime of big benchmarks based on the customer use cases is not always under our control. A use case might dictate that a large chunk of data has to be processed, thus making the run time unacceptably long.

Runtime of more focused benchmarks is under better control of programmers. Because specific known subsystems are targeted, volume of data they should process during each experiment may be big enough to get the results with narrow enough confidence intervals (i.e., with most of the measurement noise averaged out), but still small enough to be fast for having quick feedback loop.

In summary, a set of many frequently executed small benchmarks for subsystems allows to see performance effects of change requests in a more actionable way than a few rarely executed giant benchmarks for the whole system.

Static analysis

Here I refer to the static analysis in its widest meaning, as automatic partial interpretation of a program. A common property of any static analysis tool is non-zero chance of false positives: reporting issues that turn out to be important after a closer examination. Humans are very adaptive to ignoring problems that are not immediately pressing for resolution. Any long list of problems will soon be ignored if it is repeatedly demonstrated. It would not matter that fresh truly important issues could be appearing in this list every day.

Because of that, it is critical to carefully ration feedback from such tools. If the feedback strongly correlates with code a human has recently been working on, chances of issues found in it to be addressed are much higher. Conversely, if static analysis issues are only infrequently reported as a long list of problems all over the code base, nobody in the team will take them seriously.

In my project, a certain static analysis tool was originally introduced because of corporate compliance pressure. The scanning was done without ensuring that team developers understood how to extract value from it. In fact, the analysis job was added with seemingly wrong goals in mind. As the result, the scan process was slow (and nobody was trying to improve it), it was only done once per day, and it did not gate the release in cases its results had worsened. Unsurprisingly, this static analysis was essentially ignored by everyone.

At some point, the scanning process had been sped up to the point when it could be run frequently. We switched to an approach when every change request would be annotated with static analysis results. As a tool for annotating change requests, the static analysis finally started to make a difference by catching issues not detected by the compiler or tests, on which the programmers could react.

We still run that old regular nightly old style static scanning job for the whole project. Nobody still looks at it, but for a different reason: why bother if daily regular scans report the same information earlier and in a more convenient way?

Regular test suites

Note how all the metrics above can be approximated or represented by continuous values: test coverage as percentage, performance as a floating point number, density of static analysis errors.

Contrary to them, regular test sites have the situation to be more binary/boolean.

In a healthy software project its full test suite is either passing at any given change request or not. The outcome is defined as a logical AND of results of all test cases constituting the suite. There should be no difference between applying the test suite to the full code base and to a smaller change request. The boolean outcome must be the same both times. The run time may be sometimes optimized for incremental runs by skipping subsets of tests that can be proven to be unrelated to the change, i.e., outcomes of those are proven to be unaffected by the change.

In less healthy projects outcomes of test suites are regrettably continuous values. For them, it is rarely happens so that absolutely all test cases are passing. Such projects resort to a weak definition of success by having tests being good enough. Usually, it means that a certain high percentage of tests happen to pass at given moment of time.


Written by Grigory Rechistov in Uncategorized on 06.03.2023. Tags: tests, benchmarking, static analysis, incremental analysis,


Copyright © 2024 Grigory Rechistov