Massive performance regression between `nightly-2022-08-12` and `nightly-2022-08-13` #102952

Robbepop · 2022-10-12T09:21:43Z

Usually I develop on the stable channel and wanted to see how my project performs on the upcoming beta or nightly channels. I saw performance regressions between 30-80% across the board in native and Wasm targets. Those regressions were confirmed by our benchmarking CI as can be seen by the link.

Was it the LLVM 15 Update?

I conducted a bisect and found that the change happened between nightly-2022-08-12 and nightly-2022-08-13.
After short research I saw that Rust updated from LLVM 14 to 15 in exactly this time period: #99464 Other merged commits in this time period were not as suspicious to me.

Past Regressions

Also this unfortunately is not the first time we saw such massive regressions ....
It is extremely hard to craft a minimal code snippet out of wasmi since it is a very heavily optimized bunch of code with lots of interdependencies.
Unfortunately the wasmi project is incredibly performance critical to us. Even 10-15% performance regression are a disaster to us let alone those 30-80% we just saw ...

Hint for Minimal Code Example

I have one major suspicion: Due to missing guaranteed tail calls in Rust we are heavily reliant on a non-guaranteed optimization for our loop-switch based interpreter hot path that pulls jumps to the match arms which results to very similar code as what threaded-code interpreter code would produce. The code that depends on this particular optimization can be found here.
This suspicion is underlined by the fact that especially non call-intense workloads show most regressions in the linked benchmarks. This implies to me that the regressions have something to do with instruction dispatch.

Potential Future Solutions

The Rust compiler could add a few benchmarks concerning those loop-switch optimizations to its set of benchmarks so that future LLVM updates won't invalidate those optimizations. I am not sure how viable this approach is to the Rust compiler developers though. Also this only works if we find all the fragile parts that cause these regressions.
Ideally Rust offered abstractions that allow to develop efficient interpreters in Rust without relying on Rust/LLVM optimizations: for example guaranteed tail calls.

Reproduce

The current stable Rust channel is the following:

stable-x86_64-unknown-linux-gnu (default)
rustc 1.64.0 (a55dd71d5 2022-09-19)

In order to reproduce these benchmarks do the following:

git clone [email protected]:paritytech/wasmi.git
cd wasmi
git checkout 21e12da67a765c8c8b8a62595d2c9d21e1fa2ef6
rustup toolchain install nightly-2022-08-12
rustup toolchain install nightly-2022-08-13
git submodule update --init --recursive
cargo +stable bench --bench benches execute -- --save-baseline stable
cargo +nightly-2022-08-12 bench --bench benches execute -- --baseline stable
cargo +nightly-2022-08-13 bench --bench benches execute -- --baseline stable

The text was updated successfully, but these errors were encountered:

the8472 · 2022-10-12T19:40:16Z

That commit doesn't compile

error: couldn't read crates/wasmi/benches/wasm/wasm_kernel/res/revcomp-input.txt: No such file or directory (os error 2)
  --> crates/wasmi/benches/benches.rs:18:30
   |
18 | const REVCOMP_INPUT: &[u8] = include_bytes!("wasm/wasm_kernel/res/revcomp-input.txt");
   |                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   |
   = note: this error originates in the macro `include_bytes` (in Nightly builds, run with -Z macro-backtrace for more info)

apiraino · 2022-10-13T12:25:57Z

Meanwhile I'll nominate this issue for T-compiler discussion, I think it's a wider topic that benefits from comments of the team. Wg-prioritization discussion on Zulip

@rustbot label i-compiler-nominated

Robbepop · 2022-10-13T13:33:06Z

@the8472 I am sorry, I forgot to mention that you need to initialize git submodules before running benchmarks.
Execute

git submodule update --init --recursive

before running the benchmarks. I will update my post above to include this information.

Robbepop · 2022-10-15T09:49:09Z

I don't know if this is related but today I removed 4 unnecessary instructions from the wasmi interpreter. My expectation was that the removal shouldn't change performance at all. However, there were massive regressions again. This time I took the cargo-asm tool to analyze the Executor::execute function I linked earlier in this thread just to see massive differences between the master branch and the PR branch:

master: https://gist.github.com/Robbepop/cde5a25f00b78259a11170a6614aca90
- Roughly 3.8k lines of assembly
master + nightly: https://gist.github.com/Robbepop/a498308cd53c7e75c55f8342786f667d
- Roughly 4.1k lines of assembly
PR: https://gist.github.com/Robbepop/9df0eb661c5fcab9aa8391221fda7196
- Roughly 5.9k lines of assembly

The diff between master on stable Rust and master on nightly Rust:
https://gist.github.com/Robbepop/88660d17ec1ede77562732bb68670c8c

So it indeed seems to be the culprit of the issue that Rust + LLVM easily fails to properly optimize this function using the threaded-code style branching technique when the stars are misaligned.

Reproduce

The PR that I created today can be found here: wasmi-labs/wasmi#518
In order to make cargo-asm able to display the function I had to insert the following code:

// wasmi/src/engine/executor.rs
pub trait Execute: Sized {
    fn execute_dummy(self) -> Result<CallOutcome, TrapCode>;
}

impl<'ctx, 'engine, 'func> Execute for Executor<'ctx, 'engine, 'func, ()> {
    fn execute_dummy(self) -> Result<CallOutcome, TrapCode> {
        self.execute()
    }
}

And also add

[profile.release]
lto = "fat"
codegen-units = 1

to the Cargo.toml of the workspace since cargo-asm does not yet support the --profile argument and we want to optimize with codegen-units=1 and lto="fat".

Install the cargo-asm tool via: cargo install cargo-show-asm.
Run the cargo-asm tool via: cargo asm -p wasmi --lib --release execute_dummy > execute_dummy.asm
We need to pipe it into a file since the output is quite large.

the8472 · 2022-10-15T11:28:46Z

This indeed does look like the difference between tail calls vs. dispatching from a loop with computed jumps.

hottest instruction in the fast version. note the jmp %*rax

hottest instruction in the slow version. note all the jumps back to 24c0 and the incoming jumps at 24c5:

apiraino · 2022-10-22T11:23:16Z

WG-prioritization assigning priority (Zulip discussion).

Discussed by T-compiler (notes), a smaller reproducible would probably help:

what tools LLVM has to minimize bug reports. Perhaps there's something that can be leveraged to extract just the IR from the interpreter loop function and dependencies?

As mentioned in the opening comment, a bisection seems to point at the LLVM15 upgrade (#99464), specifically between nightly-2022-08-12 and nightly-2022-08-13.

@rustbot label -I-prioritize +P-high E-needs-mcve -I-compiler-nominated

Robbepop · 2022-10-22T13:31:11Z

@apiraino Thanks a lot for the update and bringing this issue up at the T-compiler meeting!

I am working on a MCVE, however, as stated above, I am unsure whether an MCVE is actually that useful since a fundamental problem that the code in question is extremely sensitive to changes as brought up in this comment were I demonstrated a significant performance regression by removing 4 variants from an enum.

I will update this thread once I am done with (or gave up on) the MCVE.

Summary of the underlying Problems

Please take what follows with a big grain of salt since I am no compiler or LLVM expert. 😇

The virtual machine wasmi that I am working on suffers from severe performance regressions since it relies on non-guaranteed optimizations done by Rust and LLVM for loop-switch constructs.
Users of wasmi critically depend on its performance and also on the reliability of its performance. To those users this virtual machine acts as the system. Therefore wasmi can be seen as "system level sotware" in this context.
Changing even small details about the loop-switch constructs easily changes the whole codegen which might change the performance profile dramatically. In a comment above I demonstrated this effect by removing 4 variants of a large enum.
Even if we were to add regression performance tests to the Rust performance test suite we had no guarantee that those tests would cover all of the potential pitfalls since those optimizations and their heuristics are not guaranteed at any abstraction level. Therefore, any MCVE that might work today to prevent this regression simply might stop working tomorrow without warning.
Therefore, if we want to fix this problem fundamentally we need a way to provide Rust users with more fine grained control over how those loop-switch constructs are going to be interpreted by optimizers OR introduce new abstractions to Rust that provide us with similar level of control over our control flow.

One Potential Solution

What follows now might be somewhat controversial ...

Given the summary above it seems to me that Rust lacks abstractions that allow us to take precise and explicit control over control flow constructs that are required to meet performance criteria for our system level software. Having proper guaranteed tail call abstractions in Rust is one possible solution to this problem and would completely eliminate this issue since it would provide us with sufficient amount of control over the control flow constructs.

In the past this was brought up by the author of the Wasm3 interpreter, which claims to be the fastest WebAssembly interpreter and is written in C, when asked why it was not written in Rust. They are using threaded-code dispatch.
Other users tried to emulate performance of Wasm3 using safe Rust abstractions. While this approach is really interesting it does not even come close to Wasm3 performance unfortunately and has other major downsides: https://github.com/Neopallium/s1vm
In fact guaranteed tail calls have been mentioned quite often in the context of Rust already: https://www.reddit.com/r/rust/comments/my6k5i/are_we_finally_about_to_gain_guaranteed_tail/

I think the general misunderstanding of guaranteed tail calls in Rust is that people think they are just more elegant than other solutions but in reality they provide us with more precise and explicit control over our control flow constructs than what is currently possible with Rust which results in more optimal code and optimizations while using safe Rust.

Robbepop · 2022-10-23T14:08:31Z

WIP MCVE: https://github.com/Robbepop/rustc-regression-102952/blob/main/src/lib.rs

This is not yet a real MCVE but we can use it as a starting point and it also has inferior performance on nightly compared to stable. I am not entirely certain that it covers the same issues as wasmi does.

Benchmark using

cargo test --release benchmark -- --nocapture

The outputs I receive are:

stable: time: 363.410688ms
nightly: time: 431.440033ms

Which demonstrates a moderate 18% slowdown. Since wasmi slowdown was up to 80% I bet that we can do better with the MCVE ... Also if we use the same profile as wasmi (codegen-units = 1, lto = "fat") we see that both stable and nightly have similar performance. So this MCVE is either not a fit or it needs improvement. Also it is not minimal atm.

Robbepop · 2022-10-24T08:17:48Z

One thing I just noticed when comparing master.asm and pr.asm as linked in this posting is the occurrence of .p2align throughout the files. On master.asm we have 4 of those in total whereas there are countless in the pr.asm. May this be part of the overall problem or at least be a signal to the root cause of the performance regression seen in wasmi-labs/wasmi#518?

The PR in question simply removes 4 variants of an enum.

It might very well be that this regression is not even connected to the initial regression I reported in this issue and demonstrates yet another performance pit fall that we should maybe have a separate issue for. I would be thankful if someone could answer this.

pacak · 2022-10-27T00:59:38Z

to the Cargo.toml of the workspace since cargo-asm does not yet support the --profile argument and we want to optimize with codegen-units=1 and lto="fat".

It does now. Also feel free to make a ticket if something is missing.

Robbepop · 2022-10-27T06:26:12Z

to the Cargo.toml of the workspace since cargo-asm does not yet support the --profile argument and we want to optimize with codegen-units=1 and lto="fat".

It does now. Also feel free to make a ticket if something is missing.

Hi @pacak Thanks for implementing the feature on cargo-asm. But look again who created the ticket back then: pacak/cargo-show-asm#63 👯

pacak · 2022-10-27T11:32:39Z

But look again who created the ticket back then:

Right, that's mostly for other people who might be following the steps :)

Robbepop · 2023-02-13T15:41:30Z

Any news on this?

I have the strong feeling that I encountered this performance regression bug today again in this wasmi PR:
wasmi-labs/wasmi#676

The PR does not change the amount of instructions but merely changes the implementation of a handful of the over hundred instructions that are part of the big match expression. However, this unexpectedly leads to massive performance regressions of up to 200%. Note that also benchmarks are affected that do not execute the changed instructions.

I used cargo-show-asm to display the differences between master branch and the PR:

edit: I was able to fix the performance regression. The idea is that I thought that it was maybe important for the optimizer that all match arms end in the same set of instructions. This is the commit: wasmi-labs/wasmi@325bdf1 Note that this commit doesn't change semantics, it simply moves an terminator instruction from a closure into the enclosing scope.

^ I will keep this as a reminder to myself for future regressions.

workingjubilee · 2023-03-03T09:25:27Z

@Robbepop Regarding the "one potential solution" header you mentioned a while back. Rust for some time has reserved the word "become" for exactly this reason: it has been imagined that it will be used for explicit tail calls.

So I believe the thing you are talking about is not really controversial. It is mostly awaiting someone assembling a draft RFC, thinking through all the implications, presenting it to the community (and also especially T-lang), and implementing it.

Mind, the same person need not accomplish all of these steps, by far, not even "thinking through all the implications" alone. Indeed, if one were to make this happen, they probably would be best off starting by talking to the people working on MIR opts. If explicit tail calls is something that can be done before code hits LLVM, that simplifies a lot of things (though maybe it makes other things more complex, that sort of thing happens).

Robbepop · 2023-03-05T13:46:12Z

@workingjubilee Thank you for your response!

I think I see where you are heading with your reply ...
The guaranteed tail call proposal has a very long history in Rust. The first mention that I am aware of is this issue by Graydon Hoare himself in year 2011: #217
It turned out back then that the ecosystem around Rust was not yet ready, especially LLVM.

Since then there have been many proposals to add tail calls to Rust, among others:

2014: guaranteed tail call elimination rfcs#271 (proposal)
2014: guaranteed tail call elimination rfcs#81 (proposed base implementation for rustc)
2014: guaranteed tail call elimination rfcs#271 (tracking issue for the PR)
2016: Explicit proper tail calls rfcs#1760 (proposal)
2016: https://internals.rust-lang.org/t/pre-rfc-explicit-proper-tail-calls/3797 (Pre-RFC)
2017: Tail Recursion Optimization #41694 (proposal)
2017: Proper tail calls rfcs#1888 (actual RFC)
2017: https://github.com/DemiMarie/rust/tree/explicit-tailcalls (another base implementation in rustc)
2019: Reviving tail-call elimination rfcs#2691 (newest proposal)

A small summary of the issues: rust-lang/rfcs#1888 (comment)

Following the endless discussions in all those threads I never felt that this feature in particular received a lot of support from the core Rust team. Technical reasons for this usually were open questions about borrow/drops that has been resolved by rust-lang/rfcs#2691 (comment) but never received a proper response to follow-up as well as a major open question about the calling ABI that needs to be adjusted for tail calls from what I understood.
Furthermore tail calls frequently were incorrectly perceived as an "elegant" feature for language enthusiasts oftentimes ignoring the fact that it solves niche problems that cannot be solved with any other language feature available in Rust. Therefore tail call proposals were usually handled as very low priority feature request.

This gave me personally the feeling that there is missing support from the group of people from which a potential external contributor urgently needs support. Writing yet another pre-RFC, a third base implementation for rustc or another feature proposal issue didn't seem like a good idea to me concerning the history of this feature. What is needed is commitment and support by the language team in order for someone like me to step up.

I am very open to ideas.

pnkfelix · 2023-06-30T15:02:41Z

Discussed in the T-compiler P-high review

At this point I am interpreting this issue as a request for some form of proper tail call (PTC) support. In particular, I interpret the issue author's comments as saying that they are willing to adapt their code in order to get the tail-call-elimination guarantees they want.

I too want some kind of PTC (I make no secret of my Scheme background). but given the many issues that the project has, I also want to make sure we properly prioritize them. In this case, this issue strikes me as a P-medium feature request, not a P-high regression. Therefore I am downgrading it to P-medium.

@rustbot label: -P-high +P-medium

nikic · 2023-06-30T15:04:42Z

FYI there is some ongoing work for implementing fail call support, see #112788 for the tracking issue and rust-lang/rfcs#3407 for recent activity on the RFC.

Robbepop · 2023-06-30T15:44:24Z

@pnkfelix Indeed this issue can be closed once Rust tail calls have been merged as I expect it to fix the underlying issue given it provides the control and stack growth guarantees stated in the current MVP design.

Robbepop added C-bug Category: This is a bug. regression-untriaged Untriaged performance or correctness regression. labels Oct 12, 2022

rustbot added the I-prioritize Issue: Indicates that prioritization has been requested for this issue. label Oct 12, 2022

the8472 added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Oct 12, 2022

apiraino added the T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. label Oct 13, 2022

rustbot added the I-compiler-nominated Nominated for discussion during a compiler team meeting. label Oct 13, 2022

Robbepop mentioned this issue Oct 23, 2022

Compare benchmarks between rustc stable and nightly channels wasmi-labs/wasmi#507

Closed

nikic added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Oct 23, 2022

workingjubilee added regression-from-stable-to-stable Performance or correctness regression from one stable version to another. and removed regression-untriaged Untriaged performance or correctness regression. labels Mar 3, 2023

Robbepop mentioned this issue Mar 7, 2023

Reviving tail-call elimination rust-lang/rfcs#2691

Open

Robbepop mentioned this issue May 24, 2023

Explicit Tail Calls rust-lang/rfcs#3407

Open

rustbot added P-medium Medium priority and removed P-high High priority labels Jun 30, 2023

workingjubilee mentioned this issue May 29, 2024

Inconsistent recursive fn call elimination #125698

Open

Robbepop mentioned this issue May 30, 2024

Performance regression since v0.32-beta.16 for debug builds with profile overwrites wasmi-labs/wasmi#1048

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Massive performance regression between `nightly-2022-08-12` and `nightly-2022-08-13` #102952

Massive performance regression between `nightly-2022-08-12` and `nightly-2022-08-13` #102952

Robbepop commented Oct 12, 2022 •

edited

Loading

the8472 commented Oct 12, 2022

apiraino commented Oct 13, 2022

Robbepop commented Oct 13, 2022

Robbepop commented Oct 15, 2022 •

edited

Loading

the8472 commented Oct 15, 2022 •

edited

Loading

apiraino commented Oct 22, 2022 •

edited

Loading

Robbepop commented Oct 22, 2022 •

edited

Loading

Robbepop commented Oct 23, 2022 •

edited

Loading

Robbepop commented Oct 24, 2022 •

edited

Loading

pacak commented Oct 27, 2022

Robbepop commented Oct 27, 2022

pacak commented Oct 27, 2022

Robbepop commented Feb 13, 2023 •

edited

Loading

workingjubilee commented Mar 3, 2023

Robbepop commented Mar 5, 2023 •

edited

Loading

pnkfelix commented Jun 30, 2023

nikic commented Jun 30, 2023 •

edited

Loading

Robbepop commented Jun 30, 2023

Massive performance regression between nightly-2022-08-12 and nightly-2022-08-13 #102952

Massive performance regression between nightly-2022-08-12 and nightly-2022-08-13 #102952

Comments

Robbepop commented Oct 12, 2022 • edited Loading

Was it the LLVM 15 Update?

Past Regressions

Hint for Minimal Code Example

Potential Future Solutions

Reproduce

the8472 commented Oct 12, 2022

apiraino commented Oct 13, 2022

Robbepop commented Oct 13, 2022

Robbepop commented Oct 15, 2022 • edited Loading

Reproduce

the8472 commented Oct 15, 2022 • edited Loading

apiraino commented Oct 22, 2022 • edited Loading

Robbepop commented Oct 22, 2022 • edited Loading

Summary of the underlying Problems

One Potential Solution

Robbepop commented Oct 23, 2022 • edited Loading

Robbepop commented Oct 24, 2022 • edited Loading

pacak commented Oct 27, 2022

Robbepop commented Oct 27, 2022

pacak commented Oct 27, 2022

Robbepop commented Feb 13, 2023 • edited Loading

workingjubilee commented Mar 3, 2023

Robbepop commented Mar 5, 2023 • edited Loading

pnkfelix commented Jun 30, 2023

nikic commented Jun 30, 2023 • edited Loading

Robbepop commented Jun 30, 2023

Massive performance regression between `nightly-2022-08-12` and `nightly-2022-08-13` #102952

Massive performance regression between `nightly-2022-08-12` and `nightly-2022-08-13` #102952

Robbepop commented Oct 12, 2022 •

edited

Loading

Robbepop commented Oct 15, 2022 •

edited

Loading

the8472 commented Oct 15, 2022 •

edited

Loading

apiraino commented Oct 22, 2022 •

edited

Loading

Robbepop commented Oct 22, 2022 •

edited

Loading

Robbepop commented Oct 23, 2022 •

edited

Loading

Robbepop commented Oct 24, 2022 •

edited

Loading

Robbepop commented Feb 13, 2023 •

edited

Loading

Robbepop commented Mar 5, 2023 •

edited

Loading

nikic commented Jun 30, 2023 •

edited

Loading