Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce a new feature to use dynamic dispatch to select between ADX and portable implementations at runtime #174

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

andreacorbellini
Copy link

@andreacorbellini andreacorbellini commented Jul 8, 2023

Summary

If blst is compiled with the new -D__BLST_DYNAMIC__ flag, then blst will use dynamic dispatch to run either optimized ADX code or portable code. Internal functions like mul_mont_384 which have an ADX-optimized counterpart (mulx_mont_384) will be bound when blst is loaded by the linker.

This is an alternative to -D__BLST_PORTABLE__ with better performance characteristics.

How this is implemented

This code uses different implementations depending on the platform it's being compiled for:

  • On GCC/clang with ELF as the target format, ifunc is used. This is the same approach used by glibc for functions like memcpy, which may have different implementations depending on the CPU capabilities. The way ifunc works is similar to how symbols from a dynamic library are loaded.
  • On GCC/clang without ELF, function pointers and constructor. This is not as efficient as ifunc because it generates call instructions with pointers as targets.
  • On MSVC: function pointers and .CRT$XCU. Same remarks as above.

Future improvements

This code is dynamically binding low-level internal functions like mul_mont_384, but it may be way more efficient to use the same strategy on global functions like blst_fp_inverse: that would result in fewer relocations/fewer calls to pointers.

Also make the `__blst_cpuid` and `__blst_platform_cap` symbols
accessible to other C files.
… and portable implementations at runtime

If blst is compiled with the new `-D__BLST_DYNAMIC__` flag, then blst
will use dynamic dispatch to run either optimized ADX code or portable
code.  This is an alternative to `-D__BLST_PORTABLE__` with better
performance characteristics.

The implementation uses:
- on GCC/clang with the ELF standard: `ifunc`
- on GCC/clang without ELF: function pointers and `constructor`
- on MSVC: function pointers and `.CRT$XCU`
This feature triggers compilation with `-D__ADX__ -D__BLST_DYNAMIC__`.
@dot-asm
Copy link
Collaborator

dot-asm commented Jul 9, 2023

But the portable build comes with a run-time switch to ADX code path now[/already]...

@andreacorbellini
Copy link
Author

andreacorbellini commented Jul 10, 2023

But the portable build comes with a run-time switch to ADX code path now[/already]...

Yes, I'm aware. That's implemented using jumps (test+jne), while this PR uses indirect calls, which seem to perform better according to benchmarks. Especially benchmarks performed on a real-world application seem to show a consistent speed-up ranging from 6% to 14%, when run on a Linux/ELF target (not sure about others; maybe performance would be the same as portable, I can't tell right now).

I would argue that using indirect calls also simplifies the code, if compared to using conditional jumps, however my PR is adding a new feature, not changing the existing portable feature, so this is a moot point.

Last but not least, if the technique used in this PR was applied to the top-level functions (blst_*), then I believe that the performance difference between portable and force-adx would be none (on Linux/ELF), when blst is used as a dynamic library, because the number of indirect calls would be essentially the same. Again, my PR does not do that, but I would like to hear your opinion on the approach before I invest time in a larger PR. I remember reading in a past GitHub issue some concerns regarding branch-target-injection; do those concerns still exist?

@dot-asm
Copy link
Collaborator

dot-asm commented Jul 10, 2023

while this PR uses indirect calls, which seem to perform better according to benchmarks. Especially benchmarks performed on a real-world application seem to show a consistent speed-up ranging from 6x to 12x,

This arguably qualifies as an extra-ordinary claim. Because things being equal there is no mechanism for indirect jump to deliver better performance, let alone that much. It's either a misinterpretation of results or indication of an anomaly that you hit accidentally. The latter means that the suggested approach also runs a risk of comparable penalty. Which, again, is essentially inconceivable to imagine. So how do I replicate the results? Provide concrete instructions.

if the technique used in this PR was applied to the top-level functions (blst_*)

There is an established mechanism to achieve it as is. Compile a pair of shared libraries and choose between the two at application startup time.

@dot-asm
Copy link
Collaborator

dot-asm commented Jul 10, 2023

I would like to hear your opinion on the approach before I invest time in a larger PR.

My personal [opinion] is that replicating established mechanisms, a.k.a. shared libraries, falls far beyond the focus[/scope] of this project.

@dot-asm
Copy link
Collaborator

dot-asm commented Jul 10, 2023

So how do I replicate the results? Provide concrete instructions.

Just in case, this implies that you also share your results... [And the expectation is that switching from ADX to portable would result in significant performance drop.]

@mratsim
Copy link
Contributor

mratsim commented Jul 10, 2023

But the portable build comes with a run-time switch to ADX code path now[/already]...

Yes, I'm aware. That's implemented using jumps (test+jne), while this PR uses indirect calls, which seem to perform better according to benchmarks. Especially benchmarks performed on a real-world application seem to show a consistent speed-up ranging from 6% to 14%, when run on a Linux/ELF target (not sure about others; maybe performance would be the same as portable, I can't tell right now).

In my testing vs BLST, having test+jumps has no measurable overhead vs even hardcoded function calls (i.e. before the runtime selection).

And also vs JIT with indirect calls from MCL (https://github.com/herumi/mcl/).

Tests done on both low-level field multiplications to high-level BLS signature verification including extension fields or elliptic curve arithmetic in-between.

I've detailed the current hardware architecture, in particular regarding branch predictions in this comment #10 (comment). Note that branches in this case are 100% predictable.

@dot-asm
Copy link
Collaborator

dot-asm commented Jul 10, 2023

from 6% to 14%,

Oh, so the results were misinterpreted! I mean originally it read 6x to 12x. Well, small percentages are more imaginable. As already mentioned, any particular approach is prone to hitting some anomalies[/quirks] in hardware and variations don't actually reflect inherent advantages of any particular approach, rather the circumstantial nature of performance of complex execution flows. Well, arguably 14% is hard to tolerate, but it still doesn't speak in favour of indirect branches' superiority. So still tell how to reproduce it :-)

Note that branches in this case are 100% predictable.

Branch prediction logic is akin to cache, in the sense that predictions can be evicted and as result branches rendered non-predicted. As already implied in #10, "100%" applies in controlled environment like benchmarks, but in real life it's more complicated. Heck, you can run into anomalies even in controlled environments, see https://github.com/supranational/blst/blob/master/src/asm/ct_inverse_mod_256-x86_64.pl#L746 for an even local example...

@dot-asm
Copy link
Collaborator

dot-asm commented Jul 10, 2023

"100%"

Just in case, it's not like branch predictability is the only factor that can affect the performance. In other words, it's not given that variations in question are the direct[/sole] effect of interference with branch prediction logic, so let's not hang on to it... [The remark was rather a reaction to "100%" itself. I mean real life is never "100%.";-)]

@dot-asm
Copy link
Collaborator

dot-asm commented Jul 10, 2023

variations don't actually reflect inherent advantages of any particular approach

To clarify. In respect to performance. The thing is that there are other criteria that affect the choice. For example, in my book direct jumps are preferable because of security implications from indirect ones...

@mratsim
Copy link
Contributor

mratsim commented Jul 11, 2023

The remark was rather a reaction to "100%" itself. I mean real life is never "100%."

No worries, don't deal in absolutes ;)

Heck, you can run into anomalies even in controlled environments, see https://github.com/supranational/blst/blob/master/src/asm/ct_inverse_mod_256-x86_64.pl#L746 for an even local example...

https://easyperf.net/blog/2018/01/18/Code_alignment_issues

image

and also https://lkml.org/lkml/2015/5/21/443

and for anomalies in benchmark in general and not just code alignment:

https://web.archive.org/web/20170929071306/https://www.cis.upenn.edu/~cis501/papers/producing-wrong-data.pdf

@dot-asm
Copy link
Collaborator

dot-asm commented Jul 11, 2023

Code_alignment_issues

Come on! As if I don't know the stuff. My remark was actually more nuanced than that. Well, of course the nuance is not obvious, but it shouldn't grant the assumption that it was a trivial matter of alignment of the subroutine's entry point;-) The hint was the explicit reference to Coffee Lake. Because it's a problem specifically on Coffee Lake. And 40% is totally disproportional in the context, a loop with >16 iterations with strong data dependencies between iterations. Oh well...

producing-wrong-data.pdf

This kind of aligns with what I'm talking about. One data point asis just one data point, one can't use it to reach an overarching conclusion as OP suggests. Correct course of action is reproduce, identify the root cause, then see where it leads you...

@dot-asm
Copy link
Collaborator

dot-asm commented Jul 28, 2023

Correct course of action is reproduce, ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants