Pitfalls of measuring software performance

For every measurement, there is at most one (i.e., sometimes it is zero) way to carry it out correctly, and innumerable ways to do it wrong.

I do not know a guaranteed method to find the right solution. Willingness to learn, to reevaluate results, to remeasure and to adjust is the only way forward I know.

I want to share a few pitfalls that I fell into (and probably will do it again in the future) while working on software performance and doing measurements of said software.

I will give some examples of how I perform measurements and report results. In no way this is a recommendation on how you should do them, but an explanation of how some of pitfalls can be avoided.

I hope this will help someone to avoid seeing a number of their screen, writing it down and then presenting it as an absolute truth, when in reality that number bears no meaning.

(Dilbert sketch on misleading benchmarks goes here.)

TL;DR

The thing you measure should be simplified in such a way that you can with reasonable confidence claim that you understand what it does and what you are measuring. This way, any numbers you obtain are more likely to faithfully describe that thing.

Defining terms

When working on software performance, we usually compare different versions of software applied to same of different use cases, etc.

I will be using the following two qualifiers:

  1. Baseline — “unmodified” variant of software, an earlier version of it.
  2. Contender — changed software that “competes” against the baseline variant.

Forgetting to ask the question that you want to answer with measurements

A measurement is one step of the scientific method cycle: after you have formed a hypothesis, you set out to make an experiment to collect evidence for proving or disproving that hypothesis.

It is better to explicitly write down this hypothesis, to avoid falling prey to moving the goalposts fallacy, or any other misunderstanding or misleading when communicating your results.

Examples

A few examples of such questions about software that can sometimes be answered by performance measurements.

  1. A specific change in code meant to improve performance has positive effect on run time of this application applied to this input data, on this host.
  2. A specific refactoring has NO effect on run time of this application on this input data, on this host.
  3. A specific change in code meant to improve performance on the fixed set of workloads has effects on them (I will talk about how to present this data at some point later)

Note the repetition of “specific”. The more specific such statements are, the smaller the probability is that you will miss your goal. They should be specific enough to allow for easy reproducibility. It should be possible to figure out assumptions under which the measurements were carried out.

For each experiment, write down what version of software was used, what inputs were given to it, and what configuration of host system was carrying out the calculations. Another person (a customer, a colleague, etc.) should be able to look at these notes and have enough information to decide how representative your experiment is for their use cases.

When presented with the impossibility to capture the general picture, choose honest specificity.

Measuring only a single data point

This is just unacceptable. So many factors can negatively affect a single experiment, and you will be unaware about it.

Not only can your lonely datapoint be an order of magnitude off the mark, you can have measured a totally wrong thing and never notice that. It is good that multiple repeated measurements give us different results — the observed variation forces us to think about the nature of the noise, digging deeper into the experiment we’ve built and understand better what we actually are subjecting to measurements.

No matter which underlying statistical assumptions you have about your system, repeat your experiments at least 30 times to get multiple datapoints. You can reduce the number of repetitions in later runs if your data demonstrates that fewer runs would allow you to keep comparable accuracy. If you are in a big hurry and must have some data, still collect at least three datapoints. From them, you can still gauge validity of results, and you can calculate most important statistics for your data: mean, median, standard deviation etc.

For a badly designed experiment with wrong underlying assumptions, no amount of reruns will help to reduce variance in results.

Expecting only “yes” or “no” as possible answers from your measurements

It’s easy to look for a definitive Boolean answer to any yes-no question asked. When you get your measurement for contender, it is often compared against the baseline.

Reality is often more nuanced in this regard. There can be four, not two, outcomes when comparing baseline and contender:

  1. Yes, the contender is “better” (for the given metric) than baseline.
  2. No, the contender is “worse” (for the given metric) than baseline.
  3. There is no statistically significant difference between the baseline and the contender. The uncertainty of intervals for both measurements overlap, which means that they are indistinguishable in the current setting. We will return to this soon.
  4. The runs are incomparable. You discover that one of your (possibly implicit) assumptions stated about the system under test did not hold. That newly discovered factor makes comparing the numbers a moot point. It is time to change your experiment and start over. Maybe your change altered the behavior of contender, e.g. introduced a bug that stops execution prematurely. Of course an interrupted scenario will be shorter than the full scenario; comparing them is pointless.

Most of the time, we expect our efforts to matter. In reality, when measured, the answer is often: “this makes no difference”.

Measuring important and non-important together

Suppose you have a big software system with many modules. You have changed one specific module and wonder what performance impact that will have on the system as whole.

It is a mistake to start from benchmarking the whole application. Whatever existing or non-existing performance effects your change has on it, they will be muted and watered down. There is a lot of other code executed before, after and together with your modified module, and oftentimes you do not even know when, how often, and if at all, your module will be involved by those big scenarios. Compounding on that, such benchmarks also tend to be slow. You will need to rerun your benchmarks often, and if they are slow, you will be unwilling to do that, their feedback will come too late, all the usual bad stuff.

If your know which module you have changed, measure that module in isolation. It will be much faster, and the results will correlate much stronger with the change you’ve made.

Suppose your measurements show that there are no or very small effects on module’s performance. When integrated into the rest of the big system, there will likely be no discernible effects either (barring much rarer and unlikely unexpected crosscutting effects; you will still have to run bigger benchmarks at some point to ensure this is not the case).

If your measurement actually indicate significant performance effects of the change on your module, only then it is time to start thinking about how this affects the big picture. How big is the percentage of time spent inside your module relative to the rest of the application? If it is low, then maybe you should not care even if your change makes the module two or more times slower. In the big picture, that won’t matter.

Performance improvement work is different from performance preservation

It is a different story if you are wanting to improve performance of your big application, and you do not not what module to modify. In this case you definitively should start with profiling the whole thing. Once you have drilled down to specific module that you know is the bottleneck, you still need to measure it in isolation. You now have evidence that this code is important.

Do not dilute your results by including unimportant things into them.

Not making your runs reasonably fast

Benchmarking is a form of testing. Tests are useful when they are fast, because you will be willing to run them more often.

With excessively slow running benchmarks, you suffer from:

  1. Temptation/necessity to complete fewer runs than necessary to collect enough information about data distribution and variance.
  2. Mixing in non-important bits which reduce representative power of your benchmark.
  3. Have increased noise in performance from host CPU varying its speed, periodic background jobs kicking in, etc.
  4. Reluctance to improve your testing fixtures, because changes made to them necessitate to repeat your experiments.

To make your runs short, you will by necessity make them more focused. To do that, you will have to understand what is actually happening in the studied system, because you will have to keep the important parts while minimizing the unimportant ones.

Failing to cut down avoidable noise

Many modern general purpose computing systems are surprisingly non-deterministic, if you dig deep enough into their documentation, including errata documents.

It is especially true when it comes to their performance metrics. Pursuit after the highest possible peak or average performance means that the predictability of their performance characteristics was sacrificed.

The list of factors is quite long.

  • CPUs can alter their frequency by a factor of ten or more, affected by external factors (such as temperature), hardware or software controlling decisions.
  • Network communication with external world brings up varying latency of network links.
  • Presence of caches on all abstraction levels results in unpredictable latency of most operations.
  • Multitasking and multi-user nature of modern OSes means that multiple processes compete for the same processor and input-output resources.

Host CPU, firmware and operating system are not your friends — they will threw all kinds of nonsense into repeatability of your results.

You should measure and you should monitor what level of signal-to-noise is achievable on your setup. Performance changes below that level will not be measurable, and therefore will not matter.

Google Benchmark documentation contains specific advice [1] on how you can reduce variance on a host Linux system. Another chunk of solid advice comes from Pyperf documentation [2]. It is a good start, but you need to adjust it to your case. Vary these and other factors on your benchmarking stand and learn their impact on the variance of your experiments.

Neglecting to visually evaluate your data sample before averaging it

Most of the statistical laws and math formulas for averaging are built on assumptions about value distribution of the input data. For classical statistics, the main assumption that the data is normally distributed, i.e., that it follows the Gaussian distribution with defined mean and variance. Real world data, especially data comping from computers, is almost never normally distributed [3].

This means that blindly calculating statistics will give your wrong results. The traditional arithmetic means and standard deviation are quite sensitive to outliers in data.

The easiest way to detect this situation to to simply look at, or better yet, graphically plot your data points. Are there are obvious outliers? Negative or very small values that do not make sense? Do datapoints cluster around more than one central point, indicating that your data distribution is multimodal?

It is not uncommon to have to clear up data from outliers, or have to rerun a measurement because a data sample cannot be trusted. It is hard to detect and correct all of this automatically, and it does not scale if done manually. But it has to be done, lest your conclusions will be totally off.

Learn about more robust statistics existing beyond the classical formulas, but more importantly, look for unexpected strangeness in your data.

Throw away your first datapoint

Let’s go back to the issue of pervasive caching in computer systems. The caches allow them to operate faster on average, at the cost of unpredictability of which code path with which latency will be chosen by the particular transaction.

Ask yourself: which situation do you want to actually measure?

Often the answer is “the most typical one”, i.e. when the caches are hit. For that, they need to be filled with up-to-date values.

When starting a fresh application, its (and underlying operating system’s) caches are usually empty. This means that its first runs are likely to be slower because there will be more cache misses, as the caches are populated. Consequent runs starting from the shared cached state will be hitting those caches and generally take faster time to finish.

You will often notice it if you take a moment and look at your data sample population. The earliest sample will likely turn out to be a slow outlier. Dropping that first sample from your data helps if your goal is to study the cached state.

Include a warm-up pass into your experiment. It should do everything a normal measurement would do, but its results are not saved. Its goal is to fill the caches in the underlying abstraction layers.

Play a bit with the volume of warm-up to see how it affects your follow-up measurements before settling on a specific procedure. The problem with caches is that they are meant to be transparent for the program behavior, but they are affecting performance.

Getting a better understanding of the structure of your workload (and where and wheen caching happens in particular) is a good way towards making meaningful measurements.

Eating wrappers together with candy

Let’s say you want to measure latency of function fun(), but the function cannot be directly invoked in the benchmark without invoking some extra uninteresting code around it:

t0 = timer()
// implicit_prologue()
fun()
// implicit_epilogue()
t1 = timer()

duration = t1 - t0

There is some prologue that must be passed before it and an epilogue part that follows right after fun(). It is a quite typical situation, because even calls to the timer() facility are such prologue/epilogue wrappers pair, and they are not instantaneous.

The value of duration will include any latency coming from implicit_prologue() and implicit_epilogue(). You are eating your candy with wrappers! It might not be a big deal if the duration of fun() is guaranteed to be much longer than the wrappers; but how can you know without measuring them in isolation?

One way to minimize the systematic error caused by the prologue/epilogue is to line up many calls to fun(). This works if the single pair of wrappers are shared for them.

t0 = timer()
// implicit_prologue()
fun()
fun()
fun()
fun()
... // repeat N times
fun()
// implicit_epilogue()
t1 = timer()

duration = (t1 - t0) / N

With large N, the offset caused by non-instantaneous wrappers is proportionally reduced. It also helps when resolution of timer() is too low to accurately measure duration of a single fun() but becomes good for longer intervals.

An even better method to cancel out the latency of the wrappers is to measure the slope of linear dependency between duration and number of repetitions. To do that, measure at least two durations for different repeat counters N1 and N2, and then deduce how much the duration changes when one more fun() is added to the sequence:

t0 = timer()
// implicit_prologue()
fun()
... // repeat N1 times
fun()
// implicit_epilogue()
t1 = timer()

duration1 = t1 - t0

t2 = timer()
// implicit_prologue()
fun()
... // repeat N2 times
fun()
// implicit_epilogue()
t3 = timer()

duration2 = t3 - t2

slope = (duration2 - duration1) / (N2 - N1)

The value of slope corresponds to a singular invocation of fun() in a row of repeated funs. Of course this is only correct if repeating the call does not create new effects, such as memoization, pipelined execution, overflow of any sort of data or instruction caches etc.

If you measure duration for multiple values of N, you can plot the function duration(N) and check whether the original assumption about their linear relation indeed holds.

Not automating your measurement runs

Compared to computers, humans suck when it comes to carrying out repetitive actions. As debated above, you should be ready to repeat the experiment several times for both baseline and contender cases. Invest a little bit of time to automate those runs. It will help you to reduce risk of human errors during controlling the experiments, collecting datapoints and calculating derived values out of them.

If you have built a reliable measurement stand, chances are that you will be reusing it and expanding it in the future. It does not have to be anything sophisticated; a simplest linear script performing a fixed number of repeated runs and stashing the results somewhere safe is perfect for the task.

Automated experiments run a certain risk of missing a regression in quality of input data, which will break assumptions for your model. An automated program will gladly calculate and report averages from all kinds of garbage inputs (that is why it is vital to report confidence intervals alongside average, see below). But the risk of you messing up the steps if doing everything manually is still higher.

Failure to report what you do not know alongside with what you know

In other words, not specifying confidence intervals for your reported averages. This is possibly the next biggest measurement sin after neglecting to confirm that your data distribution fits the assumption about its distribution.

Reporting only an average value, you’d inevitably throw away a lot of information from the originally collected data. There is no way to reflect everything from the original datapoints in just a single number. The simplest way to describe a big chunk of that remaining information is to report how much variation there is in the data.

Depending on how you model your domain and what distribution your datapoints have, different formulas can be used, based on standard deviation, median absolute deviation, interquartile range, etc. It is up to you for decide the details; regardless of the chosen method, reporting confidence intervals of your values is your moral duty.

How can you write down your result for an experimental value T1? Let’s go from the worst to better formats.

  1. T1 = 34.453934985. This is just bad. No confidence interval is specified. Many digits after comma create a false impression about extremely high accuracy of the result, and they are in best case distracting, in worst case lying. Very rarely can something be measured to that high precision, and even less often such precision is needed.

  2. T1 = 34.45. Still bad. Non-significant digits are omitted, but the confidence interval is still missing. Sometimes it can be argued that the least significant digit indicates the last position that we trust to be accurate, and as such it implicitly characterizes the confidence interval. But this convention implicitly ties the measurement error to the chosen (decimal) base for writing down numbers. This is strange: how is the fact that we have ten fingers connected to the value you are measuring? Tying confidence intervals to the base of ten might only have inner meaning if what we measure are numbers of human fingers. It is better to be explicit: untie the value from arbitrarily chosen base, and demonstrate that you have in fact bothered to calculate the confidence interval.

  3. T1 = 34.45 ± 0.05. This should be good enough for many applications. Of course, you should document the method of how you calculate both the average and the confidence intervals (e.g., what p-value you use, etc.). The confidence intervals are often symmetrical, but do not hesitate to make them explicitly different if that fact is important to preserve in your reporting, e.g. T1 = 34.45 + 0,04 -0,05.

  4. T1 = (34.01, 34.4, 34.45, 34.51, 34.59) (Five numbers)[https://en.wikipedia.org/wiki/Five-number_summary] — probably an overkill in many cases, but it will allow you to describe the shape of your data sample.

Compare baseline against itself to validate your comparison

It is useful to run the same experiment twice on the same unmodified baseline version of your software to calculate and compare your averages, or to apply your data sample comparison method of choice to see what it reports.

You know for sure that there must be no statistically significant difference between the thing and itself; so both measurements should not be distinguishable from each other. Right?

Well, if the comparison still reports these samples as clearly different, you know for sure that either you have failed to reduce the background noise, have a large systematic error hidden somewhere, or something else in your method is not right. Start over and find the source of noise before comparing baseline and contender.

Assuming linear nature of processes you measure

To quote [3]: “A common assumption in textbook stats is that errors have a Gaussian distribution. If your system is linear, and a large number of unknowns contribute independently to the error, the total error will be approximately Gaussian thanks to the central limit theorem. However, if your system is highly non-linear (like most computer systems), all bets are off.”

Looking at your data sample and thinking about its nature is often enough to confirm the suspicion that it cannot be reflecting a linear behavior.

For one thing, a normally distributed value with positive central point means that negative values are still possible as meaningful. But a communication latency is by definition a positive number! Maybe a Log-Normal or Pareto distributions will be a better fit for your data?

For systems with any form of caching present, the distribution of sample data will be bimodal or multimodal: most of values will be grouped around latency of a fast code path hitting the cache, but another smaller group of values will be grouped around the, cache-miss case.

In such cases, I found that a median value is more robust to existence of outliers than mean value, and median absolute deviation makes much more sense that standard deviation. Your mileage may vary. Look at your data.

Stumbling upon a linear process is more an exception than a rule here.

Adding up values that do not make sense together

The first step of finding many types of averages is summation of data samples.

To quote [3] again, “the average is calculated by adding up all the values, so it’s only meaningful if adding up delays is meaningful in the first place”.

While it is strictly wrong (a type violation?) to add values of inconsistent dimensions (e.g., adding milliseconds to page faults), even values measured in same units do not always have meaning when clumped together. Let’s start with a few examples from and complement them wit computer-related examples.

  1. Average body temperature in a hospital. The population sample here are all patients, admitted to the hospital for a variety of unrelated reasons. The reasons for each patient to have a given temperature is independent to why others are here.
  2. Total moving speed is not calculated as a sum of speeds taken at some points during a trip. A larger speed sample means that the traveled distance when it is relevant is shorter, i.e., that its role in the total travel time is lower. A harmonic mean is likely what you’d need.
  3. Imagine you have a set of software application workloads (SPEC benchmarks come to mind). Summing up their execution times only makes sense if they are executed serially by the some customer. This is not how those programs are utilized in practice: not all customers run all of them, and they are likely started independently from each other.

The summation loses information about individual components. Be sure that you do want to lose that bit of information. Otherwise, it is better to treat unrelated values separately.

Dividing by multimodal populations

In division, quotient = dividend / divisor.

It is easy to get a quotient that does not have sensible physical meaning, if the divisor has multimodal distribution.

Let’s have an example: gross national product (GNP) per capita. The troubles with meaning start from the GNP itself: it mixes all sorts of economical products from different industries. Dividing it by population count implies that every person contributes to that value at the same rate. This is a very unsubstantiated assumption: poor, middle-income and rich people have very different patterns.

Summing values together into a dividend has sensible meaning if the individual entities behind those numbers belong to the same mode. The same applies to the entities carrying values going into the divisor: they all must have the same meaning in the chosen model. If the divisor clumps two or more modalities, e.g., “thin” and “fat” entities, together, then the result of division will not be meaningful: it will not associate the quotient with either of the mode.

Summary of steps carrying out measurements

  1. Write down what question your measurements are meant to answer.
  2. Write down the compared scenarios being measured, including versions of applications under test, input data fed to them, and host configurations, with enough detail to be able to retrace your steps
  3. Automate your experiments to exclude as much of manual processing as reasonable.
  4. Run the baseline experiment as many times as you can afford it. 30 is usually enough.
  5. Look at the datapoints. Do they look normally distributed? Any unexplained outliers?
  6. Run the baseline experiment again.
  7. Compare two baseline results. Can your comparison algorithm tell them apart? If yes, you have too much leftover noise affecting your system. Reduce the noise and start over, collect more datapoints, and/or use more advanced statistical tests.
  8. Run the contender experiment.

More practical advice. Not everything of the following should be unconditionally done, but you should measure effects of different sources of noise to your experiments. Neither is this list exhaustive. And some of the advice here is to ignore, when you do want to include certain factor into the results.

  1. Do not run on a busy system. It should not have multiple users logged in; background tasks should be reduced to minimum.
  2. Disable features for opportunistic performance, such as Intel Turbo Boost.
  3. Fixate frequency of your CPU. Usually the stable sustained frequency will be noticeably lower than what marketing data for that CPU promises.
  4. Be mindful on whether you want to include IO into the system.
  5. Pin your threads to logical CPUs to prevent both migration costs and time jumps of timer values.
  6. Verify assumptions about the host before and after measurements. Was the system quiesced all the way? How many page faults have been served for your program? Were there any messages about thermal throttling in kernel logs? Did any counters that indicate that some unexpected concurrent activity took place has increased? And so on. Be ready to throw a data sample away if it has become contaminated by any such unaccounted for factors.

Conclusions

Nobody said it will be easy. But it can be fun.

References

  1. https://github.com/google/benchmark/blob/main/docs/reducing_variance.md
  2. https://pyperf.readthedocs.io/en/latest/system.html
  3. https://theartofmachinery.com/2021/12/01/textbook_stats_and_tech.html

Written by Grigory Rechistov in Uncategorized on 02.12.2025. Tags: statistics, performance, measurement, experiment,


Copyright © 2025 Grigory Rechistov