Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Account for varying CPU frequency more robustly #138

Open
cfallin opened this issue Jun 3, 2021 · 17 comments
Open

Account for varying CPU frequency more robustly #138

cfallin opened this issue Jun 3, 2021 · 17 comments

Comments

@cfallin
Copy link
Member

cfallin commented Jun 3, 2021

Most modern CPUs scale their clock frequency according to demand, and this CPU frequency scaling is always a headache when running benchmarks. There are two main dimensions in which this variance could cause trouble:

  • Varying frequency across time: if the CPU load of benchmarking causes the CPU to ramp up its frequency, then different benchmark runs could observe different results based on different CPU frequency.
  • Varying frequency across space: if different CPU cores are running at different frequencies, then benchmark runs might intermittently experience very different performance if they are not pinned to specific cores.

I've been seeing some puzzling results lately and I suspect at least part of the trouble has to do with the above. I've set my CPU cores to the Linux kernel's performance governor, but even then, on my 12-core Ryzen CPU, I see clock speeds between 3.6GHz and 4.2GHz, likely due to best-effort frequency boost (which is regulated by thermal bounds and so unpredictable).

Note that measuring only cycles does not completely remove the effects of clock speed, because parts of performance are pinned to other clocks -- e.g., memory latency depends on the DDR clock, not the core clock, and L3 cache latency depends on the uncore clock.

The best ways I know to avoid noise from varying CPU performance are:

  • Have longer benchmarks. Some of the benchmarks in this suite are only a few milliseconds long; this is not enough time to reach a steady state.
  • Interleave benchmark runs appropriately. Right now, it looks like the top-level runner does a batch of runs with one engine, then a batch of runs with another. If the runs for different engines/configurations were interleaved at the innermost loop, then system effects that vary over time would at least impact all configurations roughly equally.
  • Pin to a particular CPU core. For single-threaded benchmarks, this is probably the most robust way to have accurate A/B comparisons: if cores have slightly different clock frequencies, just pick one of them. Even better would be to do many runs and average across them all, but in high-core-count systems, removing this noise would take a lot of runs (hundreds of processes); with a few (5-10) process starts, it's entirely possible for the variance in mean core speed to be significant.
  • Observe CPU governor settings when on a known platform (Linux: /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor text file, will usually be ondemand, we wantperformance) and warn if scaling is turned on

Thoughts? Other ideas?

@cfallin
Copy link
Member Author

cfallin commented Jun 3, 2021

Ah, one other simple thing we could do: add "instruction count" to our metrics, along with cycles and wallclock time. While there isn't necessarily a simple relationship between instruction count and performance (IPC can vary widely), it is at least a deterministic (or very nearly deterministic) measure for single-threaded benchmarks.

@fitzgen
Copy link
Member

fitzgen commented Jun 3, 2021

Note that measuring only cycles does not completely remove the effects of clock speed, because parts of performance are pinned to other clocks -- e.g., memory latency depends on the DDR clock, not the core clock, and L3 cache latency depends on the uncore clock.

Yeah, the hope was that we could largely avoid CPU scaling bias by measuring cycles instead of wall time. This is more of a mitigation than a solution, though, and it seems like it isn't enough, apparently.

Have longer benchmarks. Some of the benchmarks in this suite are only a few milliseconds long; this is not enough time to reach a steady state.

I think we should remove all of the shootout benchmarks. They were useful for verifying that we had similar numbers to the old sightglass, which also used these benchmarks. But they are tiny, micro benchmarks that don't reflect real world programs nor are they snippets of code that we found Wasmtime/Cranelift lacking on. I don't think they are useful anymore.

I would focus on just the markdown and bz2 benchmarks, for the moment.

Interleave benchmark runs appropriately. Right now, it looks like the top-level runner does a batch of runs with one engine, then a batch of runs with another. If the runs for different engines/configurations were interleaved at the innermost loop, then system effects that vary over time would at least impact all configurations roughly equally.

Yes, we should do this. It will require some refactoring of our parent-child subprocess interactions and a protocol between them instead of just "spawn and wait for completion".

Pin to a particular CPU core. For single-threaded benchmarks, this is probably the most robust way to have accurate A/B comparisons: if cores have slightly different clock frequencies, just pick one of them. Even better would be to do many runs and average across them all, but in high-core-count systems, removing this noise would take a lot of runs (hundreds of processes); with a few (5-10) process starts, it's entirely possible for the variance in mean core speed to be significant.

None of our benchmarks are single-threaded: we want to measure how well parallel compilation and such are helping (or not). I don't think we should invest time here.

Similar for measuring instruction count instead of cycles or wall time. I'd prefer to identify and mitigate sources of bias, instead, so we can still measure the "actual" thing we care about (wall time) rather than something that is loosely correlated with the thing we care about.

Observe CPU governor settings when on a known platform (Linux: /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor text file, will usually be ondemand, we wantperformance) and warn if scaling is turned on

This would be great to support.

@fitzgen
Copy link
Member

fitzgen commented Jun 3, 2021

Interleave benchmark runs appropriately. Right now, it looks like the top-level runner does a batch of runs with one engine, then a batch of runs with another. If the runs for different engines/configurations were interleaved at the innermost loop, then system effects that vary over time would at least impact all configurations roughly equally.

Filed a dedicated issue for this: #139

@fitzgen
Copy link
Member

fitzgen commented Jun 3, 2021

Observe CPU governor settings when on a known platform (Linux: /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor text file, will usually be ondemand, we wantperformance) and warn if scaling is turned on

Filed a dedicated issue for this: #140

@cfallin
Copy link
Member Author

cfallin commented Jun 3, 2021

Thanks for filing the issues! And sorry for dumping so many ideas at once :-)

None of our benchmarks are single-threaded: we want to measure how well parallel compilation and such are helping (or not). I don't think we should invest time here.

The execution phase is single-threaded, no?

Part of the issue is that the sampling isn't uniformly getting a mix of cores: depending on what else is going on in the background on the system, one or another core may be busy for a period of time, and the scheduler's affinity heuristics will make core assignments somewhat "sticky" for the main thread that invokes the Wasm in a given process. So one set of runs may land on a core running at say 3.6GHz and another set of runs will land on a core at 4.2GHz; this introduces bias that is hard to get rid of, even if we bump the iteration count. (Or at least, that's part of what seems to be going on in my case.)

One way around this is to go completely in the other direction, and ensure we sample on all cores. Two particular changes might help: (i) for a given compilation, do a bunch of instantiations and executions, so we can take more samples relatively cheaply; (ii) explicitly bounce between cores. Think of this like another dimension of randomization (akin to ASLR)...

Thoughts?

@cfallin
Copy link
Member Author

cfallin commented Jun 3, 2021

(For the core-bouncing, there seem to be a few crates that manipulate a thread's CPU affinity; we could e.g. spawn a thread for each known core, or just migrate a single thread.)

@fitzgen
Copy link
Member

fitzgen commented Jun 3, 2021

One way around this is to go completely in the other direction, and ensure we sample on all cores. Two particular changes might help: (i) for a given compilation, do a bunch of instantiations and executions, so we can take more samples relatively cheaply; (ii) explicitly bounce between cores. Think of this like another dimension of randomization (akin to ASLR)...

At the libwasmtime_bench_api.so level, we can do (i) today. We haven't implemented support in sightglass-cli yet, however: #122

I think (ii) would also be great to have.

Are CPU governors generally scaling individual CPUs separately? I know big-little exists, but my understanding was that it was effectively discontinued, so AFAIK, the only way we would generally see cores at different clock speeds would be CPU governors scaling individual CPUs separately.

@cfallin
Copy link
Member Author

cfallin commented Jun 3, 2021

Ah, and one more thought: have we considered any statistical analysis that would look for multi-modal distributions (and warn, at least)? If we see that e.g. half of all runs of a benchmark run in 0.3s and half in 0.5s, and the distribution looks like the sum of two Gaussians, it may be better to warn the user "please check settings X, Y, Z; you seem to be alternating between two different configurations randomly" than to just present a mean of 0.4s with some wide variance, while the latter makes more sense if we just have a single Gaussian with truly random noise.

@fitzgen
Copy link
Member

fitzgen commented Jun 3, 2021

We have #91 and talk a little bit about this in the RFC too. The idea is that we should be able to test whether samples are independent of process or iteration.

We don't have anything on file to explicitly check for multi-modal distributions / non-normal distributions, but that would be good to do as well.

@cfallin
Copy link
Member Author

cfallin commented Jun 3, 2021

Are CPU governors generally scaling individual CPUs separately?

Yup! Here's from my desktop just now:

cfallin@xap:~$ grep MHz /proc/cpuinfo | sort -n
cpu MHz		: 1860.954
cpu MHz		: 1864.221
cpu MHz		: 1864.426
cpu MHz		: 1865.451
cpu MHz		: 1901.634
cpu MHz		: 2190.570
cpu MHz		: 2193.786
cpu MHz		: 2194.961
cpu MHz		: 2195.008
cpu MHz		: 2195.191
cpu MHz		: 2195.557
cpu MHz		: 2196.633
cpu MHz		: 2196.637
cpu MHz		: 2197.652
cpu MHz		: 2198.483
cpu MHz		: 2198.532
cpu MHz		: 2198.601
cpu MHz		: 2199.305
cpu MHz		: 2199.702
cpu MHz		: 2200.719
cpu MHz		: 2228.336
cpu MHz		: 2391.912
cpu MHz		: 2799.478
cpu MHz		: 2799.987

That's with ondemand; if I switch every core to performance, then:

cfallin@xap:~$ grep MHz /proc/cpuinfo | sort -n
cpu MHz		: 3541.040
cpu MHz		: 3544.273
cpu MHz		: 3547.992
cpu MHz		: 3549.782
cpu MHz		: 3550.331
cpu MHz		: 3551.234
cpu MHz		: 3561.568
cpu MHz		: 3564.328
cpu MHz		: 3565.276
cpu MHz		: 3568.220
cpu MHz		: 3568.678
cpu MHz		: 3578.373
cpu MHz		: 3592.133
cpu MHz		: 3592.622
cpu MHz		: 3593.216
cpu MHz		: 3594.283
cpu MHz		: 3597.494
cpu MHz		: 3599.688
cpu MHz		: 3623.135
cpu MHz		: 3634.126
cpu MHz		: 3782.142
cpu MHz		: 3899.312
cpu MHz		: 3933.426
cpu MHz		: 3976.012

note that most are around 3.5GHz but if you land on one of the last six, you're in for a (slightly faster) wild ride.

@fitzgen
Copy link
Member

fitzgen commented Jun 3, 2021

Have longer benchmarks. Some of the benchmarks in this suite are only a few milliseconds long; this is not enough time to reach a steady state.

I think we should remove all of the shootout benchmarks. They were useful for verifying that we had similar numbers to the old sightglass, which also used these benchmarks. But they are tiny, micro benchmarks that don't reflect real world programs nor are they snippets of code that we found Wasmtime/Cranelift lacking on. I don't think they are useful anymore.

cc @abrown: we talked about removing these before, after we had more benchmark programs, and after we verified that our results are roughly in the same range as old sightglass. AFAIK, those things are pretty much done (we can always add more benchmark programs, but we have a couple solid C and Rust programs). Do you feel okay about removing the shootout benchmarks at this point?

@abrown
Copy link
Collaborator

abrown commented Jun 4, 2021

(Sorry, coming in a bit late to the conversation). @cfallin, I always sort of took the precision crate measurements with a grain of salt: I didn't really trust the "measure the CPU frequency, then execute" model but it seemed ok for micro-benchmarks like shootout and it was a mechanism already in use in the project. I added the --measure perf-counters thinking that perf would be more reliable. It includes instructions retired, which you may useful and it also includes cycles--now, I haven't looked into this but if any project has correctly implemented cycle measurements given CPU frequency scaling, I would think it would be perf. You might also find the cache accesses and misses helpful. Can you try out --measure perf-counters?

@fitzgen, re: removing the shootout benchmarks, I'm conflicted. Your comment that they don't reflect real-world programs is accurate, but I'm pretty sure @jlb6740 is still using them occasionally and in terms of analyzability, it is a lot easier to figure out what is impacting performance in small programs. And they're already all set up. What about adding a "manifest" feature to the CLI? When no benchmark is specified, only the benchmarks in the manifest are run?

@cfallin
Copy link
Member Author

cfallin commented Jun 4, 2021

Re: perf counters (thanks @abrown), and also instruction counts as discussed here:

Similar for measuring instruction count instead of cycles or wall time. I'd prefer to identify and mitigate sources of bias, instead, so we can still measure the "actual" thing we care about (wall time) rather than something that is loosely correlated with the thing we care about.

It may merit a separate issue to discuss and work out any implications, but I think I should offer my perspective as a benchmark user trying to tune things: instruction counts are extraordinarily useful too, and I don't think we should shy away from them so strongly, or at least, I'd like to offer their merits for (re)consideration :-)

Specifically, the two useful properties are determinism and monotonicity. Determinism completely sidesteps all of these measurement-bias issues: in principle, it should be possible to measure an exact instruction count for a single-threaded benchmark, and get the same instruction count every time. This saves a ton of headache. Monotonicity, i.e. that a decrease in instruction count should yield some decrease in runtime, is what allows it to be a fine-grained feedback signal while tweaking heuristics and the like. In other words, the slope (IPC) is variable but gradient-descent on instruction count should reduce runtime too. This was fantastically useful especially while bringing up the new backends last year. FWIW this has always been my experience in any performance-related research too: clean metrics from deterministic models allow for more visibility and much more effective iteration, and "end-user metric" measurements are useful mostly to report the final results.

The swings I'm trying to measure with regalloc2 are so far coarse-grained enough that I've been able to use wallclock time, but I guess I just want to speak up for this (instruction-counting) usage pattern and make sure it remains supported too!

@fitzgen
Copy link
Member

fitzgen commented Jun 4, 2021

@abrown

it is a lot easier to figure out what is impacting performance in small programs.

This is undoubtedly true, but my response would be... does it matter? As we already discussed, they don't reflect real programs that we actually care about.

@cfallin

but I guess I just want to speak up for this (instruction-counting) usage pattern and make sure it remains supported too!

Definitely we shouldn't remove support for instruction counting! I still think our default should be cycles.

@fitzgen
Copy link
Member

fitzgen commented Jun 4, 2021

We don't have anything on file to explicitly check for multi-modal distributions / non-normal distributions, but that would be good to do as well.

Filed #142 for this.

@jlb6740
Copy link
Collaborator

jlb6740 commented Jun 9, 2021

@cfallin Hey guys, just checking out this issue and starting to read through. When you set the scaling governor to ondemand are you also setting the scaling_min_freq and scaling_max_freq. Typically when doing analysis where I need a baseline run, I set the governor to "userspace" if available, but more importantly set the scaling frequencies to something like 1GHz. Also pin to a core like you say and do runs with both hyperthreading first turned off, then turned on.

@cfallin
Copy link
Member Author

cfallin commented Jun 9, 2021

@jlb6740 thanks for the suggestions! I was running tests with the performance governor actually; I didn't know about userspace but that combined with explicit frequency settings would make even more sense. I'll add a note about this in #140.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants