Atakua's Diary

Pitfalls of measuring software performance

For every measurement, there is at most one (i.e., sometimes there are zero) way to carry it out correctly, and innumerable ways to do it wrong.

I want to share a few pitfalls that I fell into (and probably will do it again in the future) while working on software performance and doing measurements of said software. Willingness to learn, to reevaluate results, to remeasure and to adjust is the only way forward I know.

I will give some examples of how I perform measurements and report results. In no way this is a guide on how you should do your measurements, but an explanation of how some of the pitfalls can be avoided.

Approximately correct, AS IS, with no warranties is as good as it gets anyway // Rico Mariani’s Performance Tidbits

I hope this will help someone to avoid seeing a number of their screen, writing it down and then presenting it as an absolute truth, when in reality that number bears no meaning.

(Dilbert sketch on misleading benchmarks goes here.)

TL;DR

The thing you measure should be simplified in such a way that you can with reasonable confidence claim that you understand what it does and what you are measuring. This way, any numbers you obtain are more likely to faithfully describe that thing.

TL;DR
Defining terms
Forgetting to ask the question that you want to answer with measurements
- Examples
Measuring only a single data point
Expecting only “yes” or “no” as possible answers from your measurements
Measuring important and unimportant together
- Performance improvement work is different from performance preservation
Not making your runs reasonably fast
Failing to cut down avoidable noise
Neglecting to visually evaluate your data sample before averaging it
- Throw away your first datapoint
Eating wrappers together with candy
Not automating your measurement runs
Failure to report what you do not know alongside with what you know
- Compare baseline against itself to validate your method
Assuming linear nature of processes you measure
Adding up values that do not make sense together
Dividing by multimodal populations
Conclusions
References

Defining terms

When working on software performance, we usually compare different versions of software applied to same of different use cases, etc.

I will be using the following two qualifiers:

Baseline — “unmodified” variant of software, an earlier version of it.
Contender — changed software that “competes” against the baseline variant.

Forgetting to ask the question that you want to answer with measurements

A measurement is one step of the scientific method cycle: after you have formed a hypothesis, you set out to make an experiment to collect evidence for proving or disproving that hypothesis.

It is better to explicitly write down this hypothesis, to avoid falling prey to moving the goalposts fallacy, or any other misunderstanding or misleading when communicating your results.

Examples

A few examples of such questions about software that can sometimes be answered by performance measurements.

Does this specific change in code, meant to improve performance, have a positive effect on run time of this application, when applied to this input data, on this host?
Does a specific refactoring that should have NO effects on run time of this application on this input data, on this host, is indeed so?
Does this specific change in code, meant to improve performance of the fixed set of workloads, have effects on them (I will talk about how to present this data at some point later).

Note the repetition of “specific”. The more specific such statements are, the smaller the probability is that you will miss your goal. They should be specific enough to allow for easy reproducibility. It should be possible to figure out assumptions under which the measurements were carried out.

For each experiment, write down what version of software was used, what inputs were given to it, and what configuration of host system was carrying out the calculations. Another person (a customer, a colleague, etc.) should be able to look at these notes and have enough information to decide how representative your experiment is for their use cases.

When presented with the impossibility to capture the general picture, choose honest specificity.

Measuring only a single data point

This is just unacceptable. So many factors can perturb a single experiment, and you will be unaware about it.

A lonely datapoint be an order of magnitude off the mark: you can have measured a totally wrong thing and never noticed that! It is good that multiple repeated measurements give us different results — the observed variation forces us to think about the nature of the noise, digging deeper into the experiment we’ve built and understand better what we actually are subjecting to measurements.

No matter what underlying statistical assumptions you have about your system, repeat your experiments at least 30 times to get multiple datapoints. You can reduce the number of repetitions in later runs if your data demonstrates that fewer runs would allow you to keep comparable accuracy. If you are in a big hurry and must have some data, still collect at least three datapoints. From them, you can still gauge validity of results, and you can calculate most important statistics for your data: mean, median, standard deviation etc.

For a badly designed experiment with wrong underlying assumptions, no amount of reruns will help to reduce variance in results.

Expecting only “yes” or “no” as possible answers from your measurements

It is common to expect a definitive Boolean answer to any yes-no question asked. When you get your measurement for the contender, compare it against the baseline and see which one is better.

Reality is often more nuanced. There can be four, not two, distinct outcomes of the comparison.

Yes, the contender is “better” (for the given metric) than baseline.
No, the contender is “worse” (for the given metric) than baseline.
There is no statistically significant difference between the baseline and the contender. The confidence intervals for both measurements overlap, which means that they are indistinguishable in the circumstances. We will return to this soon.
The runs are incomparable. You discover that one of your (possibly implicit) assumptions stated about the system under test did not hold. That newly discovered factor makes comparing the numbers a moot point. It is time to change your experiment and start over. Maybe your code change altered the behavior of contender, e.g. it has introduced a bug, and that stops execution prematurely. Of course an interrupted scenario will be shorter than the full scenario. Drawing conclusions from the comparison is pointless.

Most of the time, we expect our efforts to matter. In reality, when measured, the answer is often: “this makes no difference”.

Measuring important and unimportant together

Suppose you have a big software system consisting of many modules, subsystems, plugins, etc. You have changed one specific module and now you wonder what performance impact that will have on the system as whole.

It is a mistake to start from benchmarking the whole application. Whatever existing or non-existing performance effects your change has on it, they will be muted and watered down. There is a lot of other code executed before, after and together with your modified module, and oftentimes you do not even know when, how often, and if at all, your module will be involved by those big scenarios. Compounding on that, such benchmarks also tend to be slow. You will need to rerun your benchmarks often, and if they are slow, you will be unwilling to do that, their feedback will come too late, all the usual bad stuff.

If you know which module you have changed, measure that module in isolation. It will be much faster, and the results will correlate much stronger with the change you’ve made.

Suppose your measurements show that there are no or very small effects on module’s performance. When integrated into the rest of the big system, there will likely be no discernible effects either (barring much rarer and unlikely unexpected crosscutting effects; you will still have to run bigger benchmarks at some point to ensure this is not the case).

If your measurement actually indicate significant performance effects of the change on your module, only then it is time to start thinking about how this affects the big picture. How big is the percentage of time spent inside your module relative to the rest of the application? If it is low, then maybe you should not care even if your change makes the module two or more times slower. In the big picture, that won’t matter.

Performance improvement work is different from performance preservation

It is a different story if you are wanting to improve performance of your big application, and you do not know what module to modify. In this case you definitively should start with profiling the whole thing. Once you have drilled down to specific module that you know is the bottleneck, you still need to measure it in isolation. You now have evidence that this code is important.

Do not dilute your results by including unimportant things into them.

Not making your runs reasonably fast

Benchmarking is a form of testing. Tests are useful when they are fast, because you will be willing to run them more often.

With benchmarks that are unnecessarily slow, you suffer from:

Temptation/necessity to complete fewer runs than necessary to collect enough information about data distribution and variance.
Mixing in non-important bits which reduce predictive power of your benchmark.
Have more sources of noise accumulated over longer times, causing larger variation, such as periodic background jobs kicking in.
Reluctance to improve your testing fixtures, because changes made to them necessitate starting your experiments over.

By making your runs short, you will by necessity make them more focused. To do that, you will have to understand what is actually happening in the studied system, because you will have to keep the important parts while minimizing the unimportant ones.

Failing to cut down avoidable noise

Many modern general purpose computing systems are surprisingly non-deterministic, if you dig deep enough into their documentation, including errata documents for hardware.

It is especially true when it comes to their performance metrics. The pursuit for highest possible peak or average performance means that the predictability of their performance characteristics was sacrificed.

The list of factors is quite long; just a few examples of noise that came to my mind.

CPUs can alter their frequency by a factor of ten or more, affected by external factors (such as temperature), hardware or software controlling decisions.
Network communication with the external world comes with varying latency of network links.
Presence of caches on all abstraction levels results in unpredictable latency of most operations.
Multitasking and multi-user nature of modern OSes means that multiple processes compete for the same calculation and input-output resources.

Host CPU, firmware and operating system are not your friends — they will threw all kinds of nonsense into repeatability of your results.

You should measure and you should monitor what level of signal-to-noise is achievable on your setup. Performance changes below that level will not be measurable, and therefore will not matter.

Google Benchmark documentation contains specific advice [1] on how you can reduce variance on a host Linux system. Another chunk of solid advice comes from Pyperf documentation [2]. It is a good start, but you need to adjust it to your case. Vary these and other factors on your benchmarking stand and learn their impact on the variance of your experiments.

Neglecting to visually evaluate your data sample before averaging it

Most of the statistical laws and formulas for processing data are built on specific assumptions about probability distribution of the input data. For classical statistics, the default assumption for data is to be normally distributed, i.e., that it follows the Gaussian distribution, and that its mean and variance values can be figured out. Real world data, especially data comping from computers, is almost never normally distributed [3].

This means that blindly calculating statistics will give your wrong results. The traditional arithmetic means and standard deviation are quite sensitive to outliers in data.

The easiest way to detect this situation to to simply look at, or better yet, graphically plot your data points. Are there any obvious outliers? Negative or very small values that do not make sense? Do datapoints cluster around more than one central point, indicating that your data distribution is multimodal?

Be prepared to clear up data from outliers, or to rerun a measurement because current data sample cannot be trusted. It is hard to automatically detect and correct all these situations. Validating data does not scale if done manually. But it has to be done, lest your conclusions will be totally off.

Learn about more robust statistics existing beyond the classical formulas, but more importantly, look for unexpected strangeness in your data.

Throw away your first datapoint

Let’s go back to the issue of pervasive caching in computer systems. The caches allow them to operate faster on average, at the cost of unpredictability of which code path with what latency will be chosen by each particular transaction.

Ask yourself: which of the two code paths: cache hit or cache miss — do you want to measure?

Often the answer is “the most typical one”, i.e. when the caches are hit. For that, they need to be filled with up-to-date values.

When starting a fresh application, its (and underlying operating system, and underlying hardware) caches are usually empty. This means that its first runs are likely to be slower because there will be more cache misses, as the caches are populated. Consequent runs starting from the shared cached state will be hitting those caches and generally take faster time to finish.

You will often notice it if you take a moment and look at your data sample population. The earliest sample will likely turn out to be a slow outlier. Dropping that first sample from your data helps if your goal is to study the cached state.

Include a warm-up pass into your experiment. It should do everything a normal measurement would do, but its results should be thrown away. Its goal is to fill the caches in the underlying abstraction layers.

Play a bit with length and scope of your warm-up pass to see how that affects your follow-up measurements before settling on a specific procedure. The problem with caches is that they are meant to be transparent for the program behavior, but they are affecting performance.

Getting a better understanding of the structure of your workload (and where and wheen caching happens in particular) is a good way towards making meaningful measurements.

Eating wrappers together with candy

Let’s say you want to measure latency of function fun(), but the function cannot be directly invoked in the benchmark without invoking some extra uninteresting code around it:

t0 = timer()
// implicit_prologue()
fun()
// implicit_epilogue()
t1 = timer()

duration = t1 - t0

There is a prologue that must be passed before fun(), and an epilogue part that follows right after fun(). It is a quite typical situation, because even calls to the timer() facility are such prologue/epilogue wrappers: the acts of time measurement are not instantaneous.

The value of duration will include any latency coming from implicit_prologue() and implicit_epilogue(). You are eating your candy with wrappers! It might not be a big deal if length of fun() is guaranteed to be much longer than the wrappers; but how can you know that before measuring it?

One way to minimize the systematic error caused by the prologue/epilogue is to line up many calls to fun(). This works if the single pair of wrappers are shared for them.

t0 = timer()
// implicit_prologue()
fun()
fun()
fun()
fun()
... // repeat N times
fun()
// implicit_epilogue()
t1 = timer()

duration = (t1 - t0) / N

With large N, the offset caused by non-instantaneous wrappers is proportionally reduced. It also helps when resolution of timer() is too low to accurately measure duration of a single fun() but becomes good for longer intervals.

An even better method to cancel out the latency of the wrappers is to measure the slope of linear dependency between duration and number of repetitions. To do that, measure at least two durations for different repeat counters N1 and N2, and then deduce how much the duration changes when one more fun() is added to the sequence:

t0 = timer()
// implicit_prologue()
fun()
... // repeat N1 times
fun()
// implicit_epilogue()
t1 = timer()

t2 = timer()
// implicit_prologue()
fun()
... // repeat N2 times
fun()
// implicit_epilogue()
t3 = timer()

duration1 = t1 - t0
duration2 = t3 - t2

slope = (duration2 - duration1) / (N2 - N1)

The value of slope corresponds to a singular invocation of fun() in a row of repeated funs. Of course this is only correct if repeating the call does not create new effects, such as memoization, pipelined execution, overflow of any sort of data or instruction caches etc.

If you measure duration for multiple values of N, you can plot the function duration(N). By looking at it, you can see whether the original assumption about the linear relation truly holds.

Not automating your measurement runs

Compared to computers, humans suck when it comes to carrying out repetitive actions. As debated above, you should be ready to repeat the experiment several times for both baseline and contender cases. Invest a little bit of time to automate those runs. It will help you to reduce risk of human errors during controlling the experiments, collecting datapoints and calculating derived values out of them.

If you have built a reliable measurement stand, chances are that you will be reusing it and expanding it in the future. It does not have to be anything sophisticated; a simplest linear script performing a fixed number of repeated runs and stashing the results somewhere safe is perfect for the task.

Automated experiments run a certain risk of missing a regression in quality of input data, which will break assumptions for your model. An automated program will gladly calculate and report averages from all kinds of garbage inputs (that is why it is vital to report confidence intervals alongside average, see below). But the risk of you messing up the steps if doing everything manually is still higher.

Failure to report what you do not know alongside with what you know

In other words, not specifying confidence intervals for your values. This is possibly the next biggest measurement sin after neglecting to confirm that your data distribution fits the assumption about its distribution.

Reporting only an average value, you’d inevitably throw away a lot of information from the originally collected data. There is no way to reflect everything from the original datapoints in just a single number. The simplest way to describe a big chunk of that remaining information is to report how much variation there is in the data.

Depending on how you model your domain and what distribution your datapoints have, different formulas can be used, based on standard deviation, median absolute deviation, interquartile range, etc. It is up to you for determine details; regardless of the chosen method, reporting confidence intervals of your values is your moral duty.

How can you write down your result for an experimental value T1? Let’s go from the worst to better formats.

T1 = 34.453934985. This is just bad. No confidence interval is specified. Many digits after the dot create a false impression about extremely high accuracy of the result, and they are in best case distracting, in worst case lying. Very rarely can something be measured to that high precision, and even less often such precision is needed.
T1 = 34.45. Still bad. Non-significant digits are omitted, but the confidence interval is still missing. Sometimes it can be argued that the least significant digit indicates the last position that we trust to be accurate, and as such it implicitly characterizes the confidence interval. In this example, it would be 34.45 ± 0.01. But this convention implicitly ties the measurement error to the chosen (decimal) base for writing down numbers. This is strange: how is the fact that we have ten fingers connected to the value you are measuring? Tying confidence intervals to the base of ten might only have inner meaning if what we measure are numbers of human fingers. It is better to be explicit: untie the value from an arbitrarily chosen base, and demonstrate that you have in fact bothered to care about the confidence interval.
T1 = 34.45 ± 0.05. This should be good enough for many applications. Of course, you should document the method of how you calculate both the average and the confidence intervals (e.g., what p-value you use, etc.). The confidence intervals are often symmetrical, but do not hesitate to make them explicitly different if that fact is important to preserve in your reporting, e.g. T1 = 34.45 + 0,04 -0,05.
T1 = (34.01, 34.4, 34.45, 34.51, 34.59) Five numbers — probably an overkill in many cases, but it will allow you to describe the shape of your data sample.

Compare baseline against itself to validate your method

It is useful to run the same experiment twice on the same unmodified baseline version of your software, obtain two values and compare them.

You know for sure that there must be no statistically significant difference between the thing and itself; so both measurements should not be distinguishable from each other. Right?

Well, if the comparison still reports these two values as statistically different, you know for sure that either you have failed to reduce the background noise, have an uncontrollable systematic error hidden somewhere, or something else in your method is not right. Start over and find the source of noise before comparing baseline and contender.

Assuming linear nature of processes you measure

To quote [3]: “A common assumption in textbook stats is that errors have a Gaussian distribution. If your system is linear, and a large number of unknowns contribute independently to the error, the total error will be approximately Gaussian thanks to the central limit theorem. However, if your system is highly non-linear (like most computer systems), all bets are off.”

Looking at your data sample and thinking about its nature is often enough to confirm the suspicion that it cannot be reflecting a linear behavior.

For one thing, a normally distributed value with positive central point means that negative values are still possible as meaningful. But communication latency is by definition a positive number and can not be negative! Maybe a Log-Normal or Pareto distributions will be a better fit for your data?

For systems with any form of caching present, the distribution of sample data will be bimodal or multimodal: most of values will be grouped around latency of a fast code path hitting the cache, but another smaller group of values will be grouped around the slower cache-miss case.

In such cases, I found that a median value is more robust to existence of outliers than mean value, and median absolute deviation makes much more sense that standard deviation. Your mileage may vary. Look at your data.

Stumbling upon a linear process is more an exception than a rule here.

Adding up values that do not make sense together

The first step of finding many types of averages is summation of data samples.

To quote [3] again, “the average is calculated by adding up all the values, so it’s only meaningful if adding up delays is meaningful in the first place”.

While it is strictly wrong (a type violation?) to add values of inconsistent dimensions (e.g., adding milliseconds to page faults), even values measured in same units do not always have meaning when clumped together. Let’s start with a few examples.

Average body temperature in a hospital. The population sample here are all patients admitted to the hospital for a variety of unrelated reasons. The reasons for each patient to have a given temperature is independent from why others have ended up here.
Total moving speed is not calculated as a sum of speeds taken at some points during a trip. A larger speed sample means that the traveled distance when it is relevant is shorter, i.e., that its role in the total travel time is lower. A harmonic mean is likely what you’d need.
Imagine you have a set of software application workloads (SPEC benchmarks come to mind). Summing up their execution times only makes sense if they are executed serially by the some customer. This is not how those programs are utilized in practice: not every customer runs all of them, and the program are likely started independently from each other, not serially one after another.

The summation loses information about individual components. Be sure that you do want to lose that bit of information. Otherwise, it is better to treat unrelated the values separately. rather than average them out.

Dividing by multimodal populations

In division, quotient = dividend / divisor.

It is easy to get a quotient that does not have sensible physical meaning, if the divisor has multimodal distribution.

Let’s have an example: gross national product (GNP) per capita. The troubles with meaning start from the GNP itself: it mixes all sorts of economical products from different industries. Dividing it by population count implies that every person contributes to that value at the same rate. This is a very unsubstantiated assumption: poor, middle-income and rich people have very different patterns.

Summing values together into a dividend has sensible meaning if the individual entities behind those numbers belong to the same mode. The same applies to the entities carrying values going into the divisor: they all must have the same meaning in the chosen model. If the divisor clumps two or more modalities, e.g., “thin” and “fat” entities, together, then the result of division will not be meaningful: it will not associate the quotient with either of the mode.

Conclusions

Write down what question your measurements are meant to answer.
Describe the scenarios being measured. Write down versions of applications under test, input data fed to them, host configurations, with enough detail to be able to retrace your steps.
Automate your experiments to exclude as much of manual processing as reasonable.
Run the baseline experiment as many times as you can afford it. Try to reach 30 run.
Look at the datapoints. Do they look normally distributed? Any unexplained outliers? Decide on what you will do with outliers.
Run the baseline experiment again.
Compare two baseline results. Can your comparison algorithm tell them apart? If yes, you have too much leftover noise affecting your system. Reduce the noise and start over, collect more datapoints, and/or use more advanced statistical tests.
Run the contender experiment. While it is running, think again if the scenario covered by it is comparable to the baseline.
Compare the baseline and the contender. Possible results are: less than, more than, statistically insignificant difference, not comparable.

More practical advice. Not everything of the following should be unconditionally done, but you should measure effects of different sources of noise on your experiments. Some of the advice here is to ignore, when you do want to include certain factor into the results.

Do not run on a busy system. It should not have multiple users logged in; background tasks should be reduced to minimum.
Disable features for opportunistic performance, such as Intel Turbo Boost.
Fixate frequency of your CPU. Usually the stable sustained frequency will be noticeably lower than what marketing data for that CPU promises.
Be mindful on whether you want to include IO into the system under test and into the measurements.
Pin your threads to logical CPUs to prevent both migration costs and time jumps of timer readings.
Verify assumptions about the host before and after measurements. Has the system been unperturbed all the way through the experiment? How many page faults have been served for your program? Were there any messages about thermal throttling in kernel logs? Have any counters that indicate that some unexpected concurrent activity took place increased? And so on. Be ready to throw a data sample away if it has become contaminated by any such unaccounted for factors.

Nobody said it will be easy. But it can be fun.