Atakua's Diary

Elusive Simulation Performance

If it is fast and ugly, they will use it and curse you; if it is slow, they will not use it. — David Cheriton

Introduction

How hard can it be to make a program work fast? The basic principle of software optimisation is a well known cycle: write it — measure it — profile it — rewrite hot spots — remeasure. Many man-months of software development are being invested into optimizing existing software by going over this cycle over and over again [2].

To squeeze the most possible performance from availalbe CPU cycles is a must for many businesses that rely on software: scientific high performance computing, high frequency trading, database software, and many more. Software simulation is no exception. As it is used to write, study and debug target software for systems before actual silicon is available, speed is essential — more runs mean more bugs can be fixed in pre-silicone stage. Products can be shipped earlier and be of better quality.

Approaches to optimization of simulation are not widely described in literature. However, the very nature of simulation algorithms and structure of data they process dictates completely different mindset to be used, compared to what one needs when looking at, say, a scientific physical simulation.

In this article, I will talk about fundamental limitations of what humans and machines can and cannot do about simulation performance. It will be shown that even defining what has to be optimized is tricky in the case of software simulation. I will try to show that, instead of focusing on traditional approaches such as improving raw bandwidth metrics and attempting to parallelize program as much as possible, one needs to focus on how well target software lends itself to virtualization and what can be done about it both from simulator and target perspective.

Code is Also Data

Classic targets for performance optimization come from the world of high performance computing (HPC). A lot of reasearch and development has been done on algorithms, software and hardware to make sure these there are no inefficient data processing, wasted CPU cycles, or false data dependencies.

The oldest benchmarks created for computer systems, such as High Performance Linpack, are focused on stressing computer hardware/software combinations by solving mathematical problems defined on huge matrixes. These benchmarks are widely critisized as unrepresentative of real world applications.

It is especially true when one wants to understand performance problems of functional simulators. The core difference, lays in the nature of workloads — actual data that programs have to proccess.

In the case of HPC problems, workload is static and rigid. A matrix is just a rectangle array of numbers. Algorithmic relations between all components are known in advance, before running a program. Respective software can be optimized to reflect data dependencies in an optimal way, e.g., by putting hottest data to caches, processing independent flows in parallel etc. Once applied, chosen optimizations will work equally well to all possible combinations of input data.

The picture is different for a simulation, because its workload is a program by itself. It is unknown what kinds of data dependencies will be present in the next workload. Are there frequently used data that should be cached? Are target threads really independent, so that they can be run in parallel? One can never tell before actually running it. When put like this, a simulator developer is tasked to apply unique optimizations for every possible input program.

Measuring Performance the Right Way

OK, maybe the task of speeding up a simulator is hard. Yet, it is not necessary to always solve the most general problem. Even having partial solutions to most common cases is often good enough.

To define success, one has to be able to measure it. Surely there are universal performance metrics applicable to simulators? Unfortunately, it is not so. I will show that common definitions of “speed” are flawed when applied to area of software simulation.

MIPS

HPC benchmarks often report values in FLOPS — floating point operations per second. A similarly-spirited metric exists for functional simulators — MIPS, average number of millions of simulated instructions executed per one real second. It is a very simple and intuitive metric. The problem is — it does not reflect what actually happens inside an execution-driven simulator.

Why FLOPS metric is good for HPC problems? Because it is known in advance how many elementary operations it would take to finish computation. For many mathematical problems it can be estimated from its algorithmical complexity. Any optimization that increases FLOPS would decrease waiting time until the program prints out its results.

For workloads executed inside a simulator, it is almost never the case. Certain algorithms are not meant to ever be finished (strictly speaking, they could not even be called algorithms — they are processes). Other algorithms even perform better when there are planned periods of idleness.

Let me give specific examples.

An operating system process scheduler task is to pass control between individual programs. Instead of consuming as much computational resources with highest MIPS achievable, it should do its work fast, yield control and return to its starting point. If it mattered how many MIPS it should demonstrate, the scheduler should have consisted of an endless loop without any data being processed in it.
A pair of parallel processes working with a resource guarded by a single lock. If one of them discovers that the resource is locked by a concurent thread, its best strategy is to become idle waiting for a notification about the resource becoming unlocked. An alternative — a busy loop continuosly polling for a lock status — will definitely increase observed MIPS, but will not decrease the time until completion.

There are many examples when performance of a cyclic process does not correlate with how many elementary operations are executed in a second. Any client-server system (web server, database server) is such an example.

Note that for trace-driven simulations, compared to execution-driven above, the size of input trace is fixed and known in advance. Because of the static nature of workload, MIPS metric is useful for them. However, for any process that has loops with unbounded numbers of iterations, performance cannot be estimated from the tempo of execution of individual elementary operations.

Virtual Time

The first caveat that a functional simulator user should learn is: “Do not trust durations of virtual time it gives you”. By design, a functional simulator is meant to relfect correct behavior of target programs, not how long they would take to finish.

To the same extent simulated (virtual) time is dangerous metric for performance. Let me illustrate that first with an extreme case, and then with real-world example.

Suppose there is a situation in a target system that caused all its processors to halt and all pending interrupts to cease. A good simulator will detect it as a period of idleness, and, instead of burning cycles in a loop (with high MIPS!), it should advance its virtual time until the next point when something useful is planned to happen. But nothing will ever happen inside this simulation — all processors are “dead”. Technically, the virtual time should become equal to plus infinity — the end of times has been reached. Was the simulation fast? Yes. Was it useful? No.

The example above is an extreme case. It is easily discoverable by a human by its symptoms. In reality, things are much more subtle and hard to debug. I once observed a situation inside a multi-core target system. There was a bug in MWAIT instruction implementation. Target processors could not correctly communicate though shared memory. A target operating system reacted to this by putting cores to sleep. In its turn, simulator detected idleness and advanced virtual time until the point when next message had to be served, only for it to be lost because of the bug. Interestingly, sometimes a message did make it through from one core to another, waking it up for a short duration.

The simulation looked like it was doing something useful. Judging from virtual time, it was doing it really fast. Similarly, MIPS were sky-high. In reality, target was barely useful.

This is the worst scenario to debug: a correctness problem looking like a performance problem. One can work on improving simulation speed to no avail only to realize that the target program or simulation engine was not correct from the start.

Notion of Computation Progress

Both MIPS and virtual time are popular metrics to judge about simulation performance. But we saw that neither number of simulated instructions nor number of simulated seconds strictly correlate with what we wanted them to be — a notion of progress.

It would be great to have a universal definition of progress: a function that monotonically increases when “useful” work is done by a program. It has to be applicable to an arbitrary algorithm or process. Having this metric, a simulator may be programmed to calculate and display it. Then programmers may be tasked with rewriting the simulator to improve it.

So all we need is for a simulator, which is a program, to analyse target system, which is another program, and draw certain conclusions about its state.

Unfortunately, this task falls under Rice theorem that states that semantic properties of programs are undecidable. Solving it is essentially equivalent to solving the Halting Problem. It has been shown in 1936 by Alan Turing that it is unsolvable.

Sad news. But can something be done about it at least for simple cases? Surely yes. For example, Simics [1] hypersimulation algorithm is capable of optimizing pieces of workloads that can be automatically proven to lead to no progress. Typically such piece of a code will attempt to busy wait on certain combination of values in shared memory.

The reasoning is simple: if nothing happens during execution of a code block (no state changes, no messages sent etc.), then the virtual time and simulated state can be advanced to the point when something new happens (another processor is scheduled for simulation, a simulated interrupt arrives etc)

The example below (written in C for simplicity, though a simulator will have to deal with machine code of course) can be optimized by switching to simulation of another CPU, which in turn may change value of flag to be non-zero.

~~~~~ int flag = 0; / flag is located in regular memory / while (flag != 1) { ; / busy wait / } ~~~~~

The next case however, looks very similar but is much harder to prove to be safe to optimize:

~~~~~ volatile int flag = 0; / platform device-backed register / while (flag != 1) { ; / busy wait / } ~~~~~

Here flag is volatile, meaning it is stored not in regular memory, but on top of a device register. A read or write to it may have arbitrary effects like sending a network packet, writing a symbol to display. It cannot be safely omitted from simulation and thus can not be optimized away.

Parallelization — not even Remotely Silver Bullet for Simulation

It is sometimes believed that, in order to speed up any program, it has to be parallelized. Sometimes it is even considered to be the first and only remedy to dicovered performance issues. In reality, this is not the case even for HPC applications. Certainly it is not for simulators.

Any algorithm has limited amount of parallelism inherently present in it. Roughly speaking, it is defined by how many independent pieces of data can be processed at any given moment of time. It does not make sense to have more parallel threads in a program to only be idle or blocked waiting for each other.

For applications with regular data accessing patterns, finding ways to parallelize is a static task. Even in HPC area, it is not an easy feat: parallel programming is a notoriously hard for humans. Neither it is easy for machines. Universal principles of automatic rewriting of programs so that they expose more parallelism is a long-lusted Holy Grail of autoparallelizing compilers since 1990’s, yet to be achieved.

In the case of simulation, the dynamic nature of its workloads does not help parallelization. One cannot expect a simulator to magically understand whether two pieces of code can be safely executed in parallel, without actually executing them and analysing results.

Let me reiterate that doing many things in parallel does not always mean doing useful things. An example: a thread busy waiting for another thread to release a semaphore just wastes CPU cycles.

Besides that, there is a number of problems specific to hardware simulation that puts a limit on what can/should be executed in parallel.

Target memory consistency protocols restrict what/when shared memory accesses can be done, and in what order. Maintaining proper order of memory accesses is especially important when host/target systems expect different protocols.
Simulated peripheral device models (such as hard disks and network cards) are typically accessed serially, meaning they become serialization bottlenecks for workloads that expect input-output (I/O) traffic.
Simulation determinism can be destroyed if parallel synchronization is done in a non-deterministic manner.

There are types of workloads that generally lend themselves to parallelization better than other. Below are typical workloads of a functional simulator with explanation whether they can benefit from parallelization of a simulator.

Booting system through EFI/BIOS firmware — not well. Target code is mostly single-threaded, and there is heavy I/O.
Booting operating systems — it depends, but generally not. There is still a lot of device probing taking place. As disk caches are mostly empty, and a lot of processes start up, a lot of page faults lead to heavy disk I/O.
Compute intensive workloads, such as HPC benchmarks — yes. As long as a program is well-written and does not demonstrate significant I/O activity, it might be effectively run in parallel inside a simulator. This is where Simics Multicore Accelerator feature is able to kick in.
Multiple independent target systems connected with high-latency links — yes. Targets are mostly independent, and messages between them can be aggregated to larger blocks which are delivered in a more regular manner. This is when Simics Multimachine Accelerator is applicable.

Before trying to parallelize a simulation workload, other known techniques for speeding it up should be attempted, as they are typically easier, simpler and provide better results. I want to support this statement with a real-world example.

Recently, we were approached by customers that wanted to split their simulation in two halves and then run them in parallel, as they were concerned about poor performance in single-thread. Their target system consisted of two tightly coupled parts, sending a lot of traffic over a supposed place of a split. Splitting target at that point would have required a lot of engineering effort. However, there would probably have been no speed up whatsoever obtained if such attempt was made. Fortunately, we did not have to finish it, because a better way was found.

After a long discussion and several debugging sessions performed by several support engineers, it has been discovered that the target configuration contained a bug that limited performance of a part of the target. Fixing it provided enough speed up for the single-threaded simulation so that no further parallelization was necessary.

Simulation-specific Aspects

It also can be observed that correctness problems of target software are often misinterpreted as performance problems of a simulator.

As an example, there was a report of our simulator being very slow on a certain workload within a range of virtual time. Upon closer inspection it was discovered that target software repeatedly attempted to access a memory section where no device was mapped. In this situation, the simulator was configured to silently return value 0xff for any byte read from that region. But it also hindered the most efficient mode of simulation — VMP.

After making sure the target algorithm indeed does everything right, we can finally turn our attention to low-level problems of simulating individual basic operations. The common case here is that operations that are cheap on real hardware become costly inside a simulator. This is no surprise, given that what is done in real hardware is performed in software inside simulation. What may come as surprise is that certain operations considered to be very fast in hardware are much more slower when simulated. I will give several examples of seemingly innocuous instructions to obtain noticeable latency when run inside a virtual machine, such as Simics VMP mode.

PAUSE instruction is basically a hint on real hardware. However, inside a simulation, it causes a costly VM-exit, followed up by VM-entry.
RDTSC is an instruction used to read one of system timers. Similarly to PAUSE, it is meant to be fast on real hardware, but causes VM-exits when it is placed inside a virtual machine.
VMREAD/VMWRITE are used to control state of a virtual machine based on Intel® VT-x hardware extensions. They are typically encountered in tight groups, and as such cause an avalanche of costly VM-exits.

For every problematic instruction, a specific solution has to be invented. Sometimes new architectural extensions present on future host hardware do help with reducing handling costs, but often there is no simple way to compensate for them.

Conclusions

Now it should be clear that one’s view on software simulation should not be simplified to simulator crunching a set of data. Rather, it should be regarded as a cooperation between a target program and a simulation engine.

How fast this tandem works depends on both sides of the equation. It is unreasonable to expect huge and universally applicalble speed gains from only changing the simulator. It is just as important to understand what target software is up to in every particular case.

When you see a simulation taking too long to finish, do not immediately focus on underlying execution engine. Instead, think — is target computation doing the right thing?

References

Daniel Aarno and Jakob Engblom. Software and System Development using Virtual Platforms, 1st Edition. 2014
Brendan Gregg. Systems Performance: Enterprise and the Cloud. 2013