How to make Churn-vs-Complexity chart

Michael Feathers’ idea about how to look for maintenance-problematic files in a repository: https://www.agileconnection.com/article/getting-empirical-about-refactoring. In short, the files that contain hard-to-understand code and simultaneously are frequently modified are the primary targets for refactoring. Simplifying them will give the biggest effect on productivity of the team working ith them.

There are cool enterprise solutions like SonarQube which can give this information to you. But I resorted to quick and dirty scripts.

Tools

For each source file, you will need a method to measure its churn (intensity of modification) and complexity (how hard for a human to understand it). Note that instead of files, you can use smaller units, e.g. classes or methods, but that would mean the churn calculation would be more involved.

  1. Complexity analyzer for whole files. In the example below, I use pmccabe for C and C++ files. It can report accumulated cyclomatic complexity for the code. In the simplest case, the lines of code is the metric to look after.
  2. Revision control history for individual files. For files under Git, counting number of commits reported by git log is a good start.

Scripts

Here is a script that I used to apply the aforementioned tools in a loop to all files. Given that it is written in Bash, it is rather cryptic. I was forced to use comments to explain what individual lines do.

    #!/bin/bash
    set -eu # Treat all errors as fatal

    # Path to the complexity reporter
    CCM="pmccabe"

    # List of all files to analyze, chosen by file extension
    FILES=`find . -name '*.[ch]' -or -name '*.cc' -or -name '*.cpp'`

    # Iterate over all files
    for F in $FILES; do
        # Collect raw output from complexity tool
        CCMRAW=`$CCM -F $F 2>/dev/null`

        # Parse the output into array
        IFS=' ' read -r -a CCMARRAY <<< $CCMRAW
        # Choose specific metric, like cyclomatic complexity, or lines of code
        COMPLEXITY=${CCMARRAY[0]}

        # Get churn value from `git log`. 
        # This invocation follows all renames, and reports one line per commit.
        # Number of lines is counted by `wc -l`
        CHURN=`git log --all --follow -M -C --name-only --format='format:' $F | wc -l`

        # Print line that can be later imported to spreadsheet.
        echo $F ${COMPLEXITY} $CHURN
    done

Run the script in the top folder of your project. It will write to the standard output.

Save its output in a text file. Format of its lines will be: file-name complexity churn.

Plotting the data

You’ll then need software to visualize the data. From a set of X/Y points with textual labels, draw a scatter diagram with annotations.

I am sure Excel can do that, after some fiddling. But I used Gnuplot.

For example, Gnuplot can render an SVG file for you:

    set encoding utf8
    set terminal svg mouse standalone dynamic
    set output "output.svg"

    set ylabel "Churn"
    set xlabel "Complexity"

    plot 'churn-vs-complexity.csv' using 2:3:1 with labels left point pt 7

Example picture

Below is a picture for the open variant of Intel XED.

Churn versus Complexity plot

As you can see, there is only a handful of files with both high turnover of modifications while simultaneously carrying complex code. The majority of files are gathered at the left bottom corner.

What to do more

Feel free to modify the workflow above to suit your needs. Change the output format, use different complexity tool (e.g. for Python, it might be another mccabe), feed different file extensions to it, or collect churn from a different revision control system.

Keep in mind that file renames may cause the churn to be under-reported. That is why I pass -M --follow to git.

Other things to try:

  1. Narrow down the scope from files to classes, or even methods/functions. It is relatively easy to do for the complexity tool, as many of them already report fine-grained information for such code units. But for the churn you’ll have to track the boundaries of individual classes or methods, which is not a trivial task. And how would you detect and treat renames in this case?
  2. Combine data for multiple types of files. E.g. if your project uses a mix of C, C++, Java and Python, it makes sense to visualize them all on the same plot.
  3. If your project has a custom language, you may need to write your own complexity measurement tool for it.
  4. Use change frequency as a metric for churn. That is, report “average changes per day” instead of “total changes since the start of history”. Or, limit the history depth to reduce less interesting historical development and focus on recent trends.

Written by Grigory Rechistov in Uncategorized on 10.12.2021. Tags: code quality,


Copyright © 2024 Grigory Rechistov