The baseline approach to reporting static analysis results

Static analysis is great because it allows to catch many types of issues without the need to run the code. It is not perfect however. When you consider it in the scope of a big evolving project, there dangers not always pointed out by the adopters. The main danger is that the new process can be experienced as detrimental to team’s productivity, leading to its abandonment.

Today, I want to cover a typical situation arising when introducing a static analysis tool regularly to an evolving codebase. I will describe a solution that is not new and has been discussed already [1], [2].

I will focus on a common mistake described in the literature and which I have seen myself. I’ll list the reasons why being able to tell old and new issues apart is so important, and why it is not as easy to do an one might think. Finally, a strategy of establishing and maintaining the baseline will be presented.

Let’s say you discover a class of problems that can be programmatically detected while analyzing the source code. So you create a tool that does such an analysis. Alternatively, you stumble upon a ready tool for the problem you would like to avoid in your project.

The tool generates issues, warnings, diagnostics etc. pointing out specific places in the code base where a problematic code snippet was encountered.

Examples

When I say “static analysis”, I mean a very wide range of programs that read other programs. Just a few examples:

  • Compiler with additionally enabled warning options (-W<rule>, /W<number> etc.)
  • New version of compiler with extended list of diagnostics.
  • New dedicated static analysis tool, such as Clang-tidy, Cppcheck, Coverity, all types of linters, security scanners etc.
  • Code formatting checker.
  • Spellchecker.
  • Intellectual property, license, copyright etc. scanner meant to process either the source code or binary artifacts for legal issues.
  • Any home-grown script or program that checks the source code for any type of criterion and reports its violations. An example would be a program that checks that all C/C++ headers are self-sufficient.

Some of these programs may produce additional artifacts (such as object code), but we are only interested in the human-interpreted output about the input source code they give.

To make the analysis automatic and regular, you find a way to automatically apply this tool to all files in your codebase. As a result, it produces a list of warnings.

The warnings will be many, and you will have to live with them for a while

The first run of any new tool demonstrates the following picture: there are hundreds or even thousands of issues lurking in the codebase! This should be no surprise, because all this code has been prepared with no regard to the classes of problems the new tool is designed to look for.

An obviously bad response would be to let all these new warnings to keep being reported at every scan. Doing so will add to the noise the developers have to deal with, without any incentive for improving the situation. If you ever worked in a project where a build generates numerous warnings every time it is run, and nobody plans to do anything about that, you will easily recognize the situation. Sending the output to /dev/null might be a better option in such case.

The new knowledge you’ve just obtained about your code should be turned into actions leading to results, such as improving the situation. But how? The first instinct is often this: let’s fix all these issues, and keep them at zero after that! Both parts of these plan are however problematic in practice.

Getting to zero is harder than it seems

One may be tempted to try to persuade everyone to stop doing whatever they have been dealing with, roll up their sleeves and fix all the new warnings. The management is unlikely to agree on that though, and for a good reason. Current existing obligations are usually more important than a bunch of warnings from any tool.

Usually, only the most serious of the new issues will be addressed immediately. It has to be something that directly harms the business, e.g. a security issue. But there will always be a long tail of remaining problems either of a lower impact or of not-yet triaged severity.

You will have to prepare to have your codebase to stay with non-zero of such issues for a while. A steady improvement plan must be devised.

Keeping it at zero is even harder

But let’s imagine that, one way or another, your team has actually brought the number of issues reported by the new scanner to zero. Good! Now every single new issue reported by it is immediately visible. The newly modified code is guaranteed to be clean from now on. You may even decide to add the requirement of having zero issues to the successful build/release criteria.

Then, one day, an update to the tool comes. A compiler version upgrade is a typical situation (even if you do not use any other ). Suddenly, many new warnings pop up in your code. That is because the tool’s new version has better diagnostic rules. But that is bad news for you because now you cannot release your product until you get back to the zero total issues state.

Another situation: you decide to include a big chunk of third-party code into your project. And of course your tool starts complaining about the issues in that inherited code. The total amount of warnings in your project jumps quite far away from zero. So you have to fix them to get back to zero, which again blocks the releases for an unexpected amount of time.

To sum up: drastic changes in the environment or big additions to the code base will cause sudden spikes in number of warnings. If such possibility is not accounted for, the strategy of zero warnings could cause interruptions for continuous delivery of your project.

The correct solution: establish the baseline

Both getting to the absolute zero, and then staying at it are very hard. An alternative is to establish a relative zero, and improve from there. It is a more achievable goal which gives almost all the benefits of having total zero of issues, but with the additional flexibility.

Instead of putting a futile effort to fight all the pre-existing issues here and now, let’s acknowledge their presence. The second step is to establish the baseline for the new tool: find a way to track and report the pre-existing issues and the new issues separately. It has to be done automatically, so that a human does not have to remember which issues are “old” and which are “new”.

Being relative gives flexibility

Let’s list all the benefits of having a baseline to compare against gives to your team.

  1. It helps to track issues where they matter most: in recently modified code, and when it is most convenient: before the problems reach the main branch. Reporting issues which strongly correlate with changes made by a developer helps a lot with understanding them. It is especially effective if the results are presented at the diff-time or earlier, e.g. by automatically annotating pull requests at the code review [3].
  2. Sudden spikes do not stop the releases. No requirement to fix everything here and now means you can plan doing it over a predictable amount of time. The regular work does not have to stop to give way to code cleanup. You do not even have to allocate a dedicated engineer to the task of dealing with all the issues.
  3. Sudden changes in the environment can be managed. If a dependency change brings new issues with itself, they can be added to the baseline the same way it has been done at the beginning, and dealt with over time.
  4. It does not stop you from fixing the pre-existing issues. Because they are tracked, the team can schedule baseline cleanup activities, and thus bring the total number of warnings closer to zero. This activity can be paused if more pressing issues are discovered, and resumed later. You can prioritize more serious classes of problems to be fixed first, and be realistic when you would return to remaining low-impact warnings.

It seems that recognizing the vale of this approach is even more important than the static analysis tool it is used for. That is because it deals with the greatest risk for the acceptance of static analysis as a method: people.

The biggest risk: team rejection of regular static analysis

It does not matter how good a static analysis tool is and how important the warnings it generates are. Not keeping the amount of diagnostic output shown to the developers under control will create resistance or insensitivity. It will diminish the value of static analysis [4].

In practice, one or more of the following will happen.

  1. New issues in added or modified code drown in the sheer volume of the pre-existing output. Humans are unable to see any new diagnostic hidden among numerous identical lines.
  2. A cargo cult of mindlessly signing off all the issues reported by the new tool arises. People treat them as noise.
  3. People become adverse to the extra manual maintenance work accompanying every change they made. The tool gets disabled, abandoned, suppressed or ignored.
  4. Future efforts of introducing additional code quality monitoring tools are met with upfront resistance based on past negative experience.

Maintaining high signal to noise ratio

A tool that analyses a single file is good if it generates high rate of true positives and low amount of false positives. In other words, it has to catch actual issues well without distracting too much with irrelevant diagnostic information.

Note that both requirements are essential. A proof by exaggeration:

  1. A tool that unconditionally reports a warning per every single input statement will certainly hit all the issues.
  2. A tool that immediately exits and always reports nothing will definitely give zero incorrect diagnostics.

Neither of these hypothetical tools would be particularly useful in practice, despite the first one having 100% of true positives, and the second one having 0% of false positives.

When working with a repository of files changing over time, the picture changes. We need to account for the fact that people usually work with changesets that span parts of files. Naturally humans expect any eventual issues to be reported against these changesets, or in files trivially tracked back to the changes.

The goal that you must achieve is that incremental modifications to the code base are associated with proportionally sized and highly correlated reports of reported warnings. In that light, even true positives but reported for an irrelevant piece of code should be treated as false positives and minimized.

So not only warnings from multiple files have to be combined in a single report, they have to be then filtered to hide irrelevant known warnings. Only the remaining entries are then shown to a human.

Filtering the global report

After application of a tool to all files in the repository, you usually end up with a list of diagnostic entries close to this format:

(file-name, line, position, error-code, human-readable description)

For example, GCC and many other tools format their warnings by separating the entries with columns:

/home/ggg/backend.c:69:25: warning: ignoring return value of setegid declared with attribute warn_unused_result [-Wunused-result]

Each entry points to a specific position in one of the source files.

The goal now is to automatically tell which of the warnings are “old” and ignore them, and which are “new” and report them. There are several strategies as to where to store information used to help to make the decision.

In-line suppression

An easy solution is to store an annotation that tells the tool to ignore a specific warning right in the code. Usually it is formatted as a comment before or inline with the code:

int result = 1 << shift; // tool:ignore-possible-overflow

Other syntactic means to express this exist, e.g. pragmas in GCC:

#pragma GCC diagnostic ignored "-Wformat"

There are however multiple reasons why this approach is not the best one.

  1. Comments are no longer free of semantics. Their contents affect the results of code translation. This approach mixes storage of metadata and data.
  2. If there are multiple tools used, or multiple warnings are issued to the same line, then annotating it with a comment may not be possible. At the very least, the code becomes overloaded with metadata.
  3. There is a risk of editing the line and fixing the problem, but forgetting to update the annotation. Or to move the line around but forget to move the annotation.
  4. If you are already taking your time to edit the problematic code line, why not spend it on addressing the reported issue in the first place? There would be no need for an annotation if the problem has been fixed.

Despite these limitations, many of static analysis tools offer this mechanism in some form.

Scoped suppression

Excluding bigger chunks of code base from static analysis can also be used to acknowledge that the problems existing there are here to stay. A few examples from my practice.

  1. When upgrading a compiler to a new version, it may start reporting new warnings in old code. In certain situations, it may be deemed reasonable to tune the list of compiler options for just specific existing files with known issues. This can be done by suppressing some of the checks with compiler flags. E.g. GCC has a -Wno-<diagnostic> for every -W<diagnostic> option to turn it off.
  2. A big chunk of 3rd-party code is included as a source code library. Static analysis for these files may be tuned to have less strict checks for this library code, which is not yet up to your standards of coding. It makes sense to do so for e.g. code style checks.

A common motive here is that the static analysis is consciously omitted for certain scopes in the code base. Examples of such scopes:

  1. The whole codebase. In other words, one or more rules of static analysis get globally disabled.
  2. Whole files and folders. This is usually performed by manipulating build rules. Makefiles and similar dependency definition languages often allow to tune rules for individual files.
  3. Parts of files. Aforementioned pragmas of C are usually used to annotate more than one line inside files.
  4. Language-specific scopes inside files: classes, functions, namespaces etc. Some of the tools allow to annotate these scopes to treat the code inside differently.

While better than nothing, such an approach is admittedly very imprecise. New issues added inside the suppressed scopes will not be reported in a timely manner. Code inside the “relaxed” scopes tend to decrease in quality as the time passes.

Presumably, excluding larger chunks would be OK for “read-only” code, e.g. 3rd party code that you do not intend to modify directly. But in practice it does not matter for the quality of the final product how you’d qualify the pieces it is built from. E.g. an ignored security issue in a third-party library will make your product vulnerable the same way an omission in your code would have.

All scopes with decreased static analysis monitoring should be treated as technical debt, and regularly reassessed.

Counting the violations

The idea of baseline can be interpreted slightly differently: “We do not allow the number of issues in our code base to increase over time”. So it makes sense to count the total number of warnings reported for a scope (in the simplest case, the scope being the whole project). Every consequent scan performed later verifies that the number of issues has not increased. If it has in fact decreased, i.e. new changes contain fixes to past problems, a new baseline is established, and future scans are compared against the new threshold.

This seemingly simple to implement approach has a few issues once you start thinking about it.

  1. The report is as imprecise as the chosen scope is big. For the whole project, figuring out where exactly new issues have been admitted will require additional effort from a human. For example, a report could look like this: “We had 1000 matches in the whole repository before, now we have 1001. Please go and fix the newly added match”. A human will then wonder, where that new match is. Being able to provide a narrower context is very desirable (e.g: “File barbaz.c, function foobar() had ten issues, now there are eleven”).
  2. It creates a loophole to “trade-in” warnings. If we count totals, then a person may (accidentally or on purpose) remove an old issue and add a new one simultaneously in a single commit. These two warnings may be very far away from each other and totally unrelated. The baseline test would not detect anything, because the sum is unchanged.
  3. Necessity to maintain the “ratchet mechanism”. The threshold value used for comparing against is stored somewhere in the repository. E.g. it could be a configuration value for the static analysis tool. If the actual number of warnings decreases from e.g. 1000 to 999, someone or something has to update the constant 1000 → 999 somewhere. If it is a human responsible for that, it would be unreliable and annoying to regularly do. If it is a machine, then scripts should be allowed to modify the repository’s source code. It is not ideal because it risks overwriting human’s edits to same files, or alter prepared commits. Revision-controlled files that are changing on their own are confusing. If the threshold number is not updated at all, it creates a window for new issues to silently creep in. E.g., with a threshold of 1000, I can decrease the actual count of issues to 990. But then another person will add a new issue. No test will react on it.

Database of exceptions

It seems that the most accurate way would be to track individual warnings outside of the codebase.

In this case, there is a storage of known issues which is separated from the code it describes (i.e. it is not inline annotations). An example would be a text file listing all the known warnings. In another case it could be an actual SQL database, an embedded SQLite storage, JSON or XML file, or anything in between.

After each analysis pass all reported issues are looked up in the database. Matched issues are dropped as known, and only remaining warnings are shown as new ones.

A trivial example would be a spellchecker tool. It reports misspelled words as those not present in the main dictionary of the tool. But we are allowed to add new words, stored as our own persistent “user-defined” dictionary. The next run of the spellchecker will use this additional dictionary and won’t find as many misspelled words.

Many issues present in the earlier approaches are solved this way. No pollution of codebase with metadata happens. The known issues are tracked individually with a tightest scope possible. As with all the approaches above, it is required to maintain the baseline by pruning stale entries from recently solved issues.

This database approach is also much more involved. Its biggest strength and weakness is defined by the chosen principle of telling the old issues from new ones. Doing it too naively will create a great deal of false positives and diminish the value of the tool. Doing it correctly usually takes some trial and error.

Let me illustrate. Suppose that each entry of the tuple (file-name, line, position, error-code, human-readable description) is used to match the known issues against the new issues. You have a small source file of four lines:

1 int main() {
2    int val = sketchy_function(...);
3    return val;
4 }

Suppose that the line 2 generates a warning like this, which is included into the baseline:

(main.c, 2, 14, C1, "Sketchy function, do not use")

Now you edit the main() by adding some stuff to its beginning:

1 int main() {
2    int pre_val = read_data(...);
3    int val = sketchy_function(...);
4    return val + pre_val;
5 }

The same static analysis tool will report the same issue like it did before. But its signature would be different, because the warning is now against line 3:

(main.c, 3, 14, C1, "Sketchy function, do not use")

The baseline comparison algorithm will compare this new entry against the database, and will make a conclusion that this is a new warning! Now imagine that you have a long file with a few dozens of known entries included into the baseline. By inserting or removing a single line in the beginning of the file, all these warnings will be reported as new, because none of them would match entries saved in the database.

This is against the idea of the reports being correlated with the changes. A single-line change is generally not expected to produce a dozens of lines of warnings, none of which are about the changed line. Let me tell you, seeing that pisses developers off.

This happens because the source code line number is a very volatile parameter. It is unsuitable for stably identifying a position in changing code base.

The same mismatch would happen if I was to rename the file main.c to e.g. code.c. All the warnings present in it would appear as new ones to the baseline comparison algorithm. Note however that, while not uncommon, file renames happen much less often than line additions or removals. An adequate programmer would agree that a renamed file is a new file, so that all the warnings present in it should be treated as new ones.

What to do? Notice how the file part of the tuple is stable enough for identifying issues, but line numbers are volatile. We can ignore line numbers when comparing against the baseline. The key used to look up entries would become shorter:

(main.c, 14, C1, "Sketchy function, do not use")

Doing that would cause another problem: if the same file have two identical issues but on different lines, the baseline filtering will not tell them apart. In the example above, if I am to change the file like this:

1 int main() {
2    int val = sketchy_function(...);
3    int alt =~sketchy_function(...);
4    return val + alt;
5 }

Two warnings will be issued by the tool for lines 2 and 3:

(main.c, 2, 14, C1, "Sketchy function, do not use")

(main.c, 3, 14, C1, "Sketchy function, do not use")

Both of them would match the same signature of known issues, and none of them would be reported, despite the line 3 being a new one and containing a new problem.

We could start counting the number of issues matched for each baseline entry, but that brings us to all the problems with the counting approach, although in a smaller scope. So that might be a good enough addition to the algorithm.

If the baseline comparison analysis could associate smaller scopes with warnings, that would help to improve detection of known issues and would improve the accuracy without generating too much noise. E.g. if each warning included function name, we could count the lines relative to the function’s start. This way, adding or removing code outside of that function could not affect the warnings localized inside it. The same goes for identifying individual basic blocks instead of lines. Including them into the baseline as a key would allow to even better identify the issues.

Note how having this information, however, makes the baseline application language-specific. It requires to actually parse the source code, which is not an easy task for many input languages. Even C is horribly complex in that regard.

Line numbers are nice because they are a universal concept — almost all documents are split into lines. Counting them requires very little parsing beyond searching for end-of-line sequences.

We should conclude that finding a perfect solution is not reasonable. It is better to balance the accuracy versus completeness when filtering the known issues.

Taking the history into account

Can we find an additional source of information that could help us decide what is new and what is old? Yes, if you use a revision control system to store the code.

If we filter out warnings reported for lines unchanged since the last time, then the remaining warnings must be highly correlated to the modifications in the current changeset.

revgrep [5] is one of such tool which relies on Git blame-like algorithm to filter out lines not changed since the specified revision.

Essentially, we trick the revision control system to match the old code versus the new one using the standard diff algorithms.

It is still not totally foolproof, one might argue. E.g. file renames will still report all the issues in the renamed file as new. Or, sometimes a diff matches the old/new lines differently from how a human would have correlated them. Whitespace changes, while all but invisible to humans, would also “revive” a warning if it exists on the touched line.

Incremental analysis

The discussion so far dealt with filtering of a full report. We assumed it was generated for the whole codebase each time. It is not always desirable to perform such a global scan. Depending on the volume of codebase and speed of the used tool, it may take just too much time to be useful.

An incremental analysis [3] can be done at e.g. pull request. Only files changed in that pull request are fed to the static analysis tool. The results may then be additionally filtered against the database of known issues, as discussed above.

The benefits of the incremental analysis is that its time complexity becomes proportional to the size of the individual changeset, compared to being proportional to the size of the full codebase as it is in the global analysis.

It should be noted that that incremental analysis is by its nature less accurate than global analysis. Suppose you want to report all functions defined but never used anywhere in the codebase. Such an analysis can only be done if the whole program is taken into the account. An incremental analysis of a single file change where a reference to a function has been removed is unable to tell whether it has been the last reference to this function or not.

The ratchet mechanism

Regardless of the chosen approach, having the baseline of known issues used to filter regular static analysis reports brings you the best value/cost ratio from any static analysis tool:

  1. All new issues are clearly and timely reported.
  2. People are not distracted by the sheer amount of pre-existing warnings during their daily work.
  3. All the remaining issues in the code base are accounted for. They can be dealt with when time allows and as other projects are being worked on.
  4. Even if a big environmental change is brought onto the project, it is a matter of reestablishing of the baseline and continuing business as usual. The team won’t be suffering from an unscheduled slippage of plans.

References

  1. Andrey Karpov. How to introduce a static code analyzer in a legacy project and not to discourage the team. https://pvs-studio.com/en/blog/posts/0743/
  2. Ivan Ponomarev. Introduce Static Analysis in the Process, Don’t Just Search for Bugs with It. https://habr.com/ru/post/440610/
  3. Dino Distefano et al. Scaling static analyses at Facebook. https://dl.acm.org/doi/10.1145/3338112
  4. Les Hatton. Safer C: Developing software for high-integrity and safety-critical systems. // McGraw-Hill Book Company, 1995
  5. revgrep is a CLI tool used to filter static analysis tools to only lines changed based on a commit reference. https://github.com/bradleyfalzon/revgrep

Written by Grigory Rechistov in Uncategorized on 02.02.2023. Tags: tests, static analysis, baseline, false positives,


Copyright © 2024 Grigory Rechistov