Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux/x64: Fix SIMD routines not linking due to illegal relocations #1367

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

teh-cmc
Copy link

@teh-cmc teh-cmc commented Oct 15, 2024

On linux/x64, if using any of the assembly SIMD routines that rely on the tables defined in tables.rs, rav1d will fail to link with the following errors:

mold: error: /home/cmc/dev/rerun-io/rerun/target/debug/deps/libre_rav1d-d1774017185e2712.rlib(mc_avx2.o):(.text): R_X86_64_PC32 relocation at offset 0x7562 against symbol `dav1d_mc_subpel_filters' can not be used; recompile with -fPIC
mold: error: /home/cmc/dev/rerun-io/rerun/target/debug/deps/libre_rav1d-d1774017185e2712.rlib(mc16_avx2.o):(.text): R_X86_64_PC32 relocation at offset 0xa65c against symbol `dav1d_resize_filter' can not be used; recompile with -fPIC
mold: error: /home/cmc/dev/rerun-io/rerun/target/debug/deps/libre_rav1d-d1774017185e2712.rlib(mc16_avx2.o):(.text): R_X86_64_PC32 relocation at offset 0xa687 against symbol `dav1d_resize_filter' can not be used; recompile with -fPIC
mold: error: /home/cmc/dev/rerun-io/rerun/target/debug/deps/libre_rav1d-d1774017185e2712.rlib(mc_avx2.o):(.text): R_X86_64_PC32 relocation at offset 0x7581 against symbol `dav1d_mc_subpel_filters' can not be used; recompile with -fPIC
mold: error: /home/cmc/dev/rerun-io/rerun/target/debug/deps/libre_rav1d-d1774017185e2712.rlib(mc16_avx2.o):(.text): R_X86_64_PC32 relocation at offset 0xa691 against symbol `dav1d_resize_filter' can not be used; recompile with -fPIC
# ... more of the same ...

Long story short, the reason this fails is because these tables are exposed publicly in the final ELF artifact, when they shouldn't be.

The C implementation avoids this problem by making use of compiler intrinsincs:

rav1d/src/tables.h

Lines 113 to 115 in c7d127e

EXTERN const int8_t dav1d_mc_subpel_filters[6][15][8];
EXTERN const int8_t dav1d_mc_warp_filter[193][8];
EXTERN const int8_t dav1d_resize_filter[64][8];

where EXTERN is defined as:
#if (defined(__ELF__) || defined(__MACH__) || (defined(_WIN32) && defined(__clang__))) && __has_attribute(visibility)
#define EXTERN extern __attribute__((visibility("hidden")))
#else
#define EXTERN extern
#endif

As far as I know, there doesn't exist any way of modifying the ELF visibility of Rust symbols directly from code.
Therefore, the best approach I have found is to modify the assembly itself in order to hide these extern references directly, and then the linker gets the hint.

For a long-winded, much more in depth explanation, see:

@thedataking
Copy link
Collaborator

Rav1d uses the assembly routines from dav1d. Have you tested if this is also an issue for dav1d? Is this mold specific?

We'd like to keep using the same assembly routines.

@kkysen
Copy link
Collaborator

kkysen commented Oct 18, 2024

I use mold, too, for development and have never seen this error. That's odd, what exactly are you doing differently?

@teh-cmc
Copy link
Author

teh-cmc commented Oct 19, 2024

Rav1d uses the assembly routines from dav1d. Have you tested if this is also an issue for dav1d?

This is not an issue with dav1d. dav1d's tables use compiler intrinsics to mark the generated ELF symbols as private.
rav1d's tables on the other hand are exposed publicly in the final binaries, and AFAIK Rust does not offer any way to control ELF symbol visibility.

dav1d:

readelf -s --wide src/libdav1d.a | rg dav1d_resize_filter | rg OBJ
     7: 0000000000001260   512 OBJECT  GLOBAL HIDDEN     4 dav1d_resize_filter
                                              ^^^^^^
                                              private

rav1d:

readelf --wide -s target/release/librav1d.a  | rg dav1d_resize_filter | rg OBJ
  5762: 0000000000000000   512 OBJECT  GLOBAL DEFAULT 5171 dav1d_resize_filter
                                              ^^^^^^^
                                              public

If there is in fact a way to do such a thing in Rust, then we could avoid modifying the assembly at all indeed.

See my original message in this thread, and the detailed explanation in rerun-io#3 (comment).

Is this mold specific?

No, this is not specific to any linker. AFAIU it all comes down to how position-independent executables, ASLR, 32bit relocations and load-time/run-time symbol interposition all interact in arcane, annoying ways on Linux/x64.

Here's stock ld failing in the same way:

  = note: /usr/bin/ld: /home/cmc/dev/rerun-io/rerun/target/debug/deps/libre_rav1d-679a6408c2a64e05.rlib(mc16_avx512.o): warning: relocation against `dav1d_mc_warp_filter' in read-only section `.text'
          /usr/bin/ld: /home/cmc/dev/rerun-io/rerun/target/debug/deps/libre_rav1d-679a6408c2a64e05.rlib(looprestoration_sse.o): relocation R_X86_64_PC32 against symbol `dav1d_sgr_x_by_x' can not be used when making a shared object; recompile with -fPIC
          /usr/bin/ld: final link failed: bad value

@teh-cmc
Copy link
Author

teh-cmc commented Oct 19, 2024

I use mold, too, for development and have never seen this error. That's odd, what exactly are you doing differently?

This isn't specific to mold, see my message above.

This will fail to link for any position-independent executable (i.e. all Linux/x64 Rust executables by default) that makes actual use of any of the problematic symbols (i.e. the SIMD-optimized routines that rely on the tables in tables.rs) such that they don't get optimized out at any stage.

It's pretty hard/annoying to minimally reproduce as you need to link to all the right symbols and generate the right 32bit relocations at the right place and time.

If you're on 64bit linux, you shoud be able to repro using:

git clone https://github.com/rerun-io/rerun.git --single-branch --branch cmc/no_simd_patch --depth 1
cd rerun
cargo r -p rerun-cli --no-default-features --features native_viewer,nasm

# grab a coffee...

<link failure>

@kkysen
Copy link
Collaborator

kkysen commented Oct 19, 2024

Sorry, I'm just confused why we have not seen this error before when building the dav1d CLI binary that depends on librav1d. These tables shouldn't be getting optimized out of that, and it compiles fine on x86_64-unknown-linux-gnu.

@kkysen
Copy link
Collaborator

kkysen commented Oct 21, 2024

Also, do you know if there's any effort to get hidden visibility or ELF symbol visibility as a whole into Rust? That would be the best solution if it existed.

I think your solution is quite good, and the asm changes are pretty unobtrusive, so upstreaming future asm changes should be pretty simple with this. But I'm still trying to understand why this is an issue for you and not for us when we build the dav1d CLI.

@teh-cmc
Copy link
Author

teh-cmc commented Oct 21, 2024

Also, do you know if there's any effort to get hidden visibility or ELF symbol visibility as a whole into Rust? That would be the best solution if it existed.

The problem is certainly known in multiple forms:

David Lattimore has a great blog post on the matter, which covers how finicky and hard to reason about all of this is.
It specifically covers shared objects, but applies broadly (and PIE are effectively just shared objects in the end).

As far as I know, the #[linkage = "internal"] attribute is supposed to be Rust's equivalent to C's __attribute__((visibility("hidden"))).
But it is unstable, and likely won't see the light of stable for many more years:

Most things related to these matters are very poorly documented, if at all. It is very possible that a solution exists out there somewhere, but I'm not aware of it.

But I'm still trying to understand why this is an issue for you and not for us when we build the dav1d CLI.

All of this is extremely finicky, filled with corner cases and weird interactions, even depends on the order you build things in. See blog post above for a taste.

I don't understand any of it enough to be able to create a minimal reproduction, the best I have is the Rerun codebase.

In this specific case, you need to hit the right order for nasm to declare the problematic public external references (which it will only do if the original table symbol itself is public -- ultimately the source of all our issues), and the right relocations, and end up with the right object files at the right time for the linker to decide that these things could be runtime-interposed and therefore illegal, etc.
It does not surprise me that years could go by without anybody hitting these kinds of issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants