Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avx512 features no longer detected in target images in v1.11 #56177

Closed
Vobarkun opened this issue Oct 15, 2024 · 19 comments
Closed

avx512 features no longer detected in target images in v1.11 #56177

Vobarkun opened this issue Oct 15, 2024 · 19 comments
Labels
regression 1.11 Regression in the 1.11 release

Comments

@Vobarkun
Copy link

I use julia on a heterogeneous compute cluster, consisting of many different nodes with different CPUs, but using a single shared file system. This has created problems before: In 1.10, precompilation is triggered each time a project is used on a different node. However, this was easily fixed by using separate projects. In 1.11 this no longer seems to be the case.

For example, let's say I create two projects, env1 and env2. I load env1 on my workstation (Intel Xeon W-2223), add a package (say JLD2) and precompile. Then I load env2 on a compute node (AMD EPYC 7302) and add the same package. Despite the different CPU, no precompilation is triggered. Then, when I try to run some code, julia crashes on an invalid instruction:

julia> using JLD2

julia> jldsave("test.jld2", a=rand(100))
Invalid instruction at 0x1552538da346: 0x62, 0xf2, 0xfd, 0x28, 0x7c, 0xc0, 0xc4, 0xc1, 0x7e, 0x7f, 0x44, 0x24, 0x10, 0x4d, 0x89

[1482091] signal 4 (2): Illegal instruction
in expression starting at REPL[3]:1
MmapIO at /home/sschult/.julia/packages/JLD2/3zWRM/src/io/mmapio.jl:14 [inlined]
MmapIO at /home/sschult/.julia/packages/JLD2/3zWRM/src/io/mmapio.jl:113
openfile at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:146 [inlined]
openfile at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:151
#jldopen#22 at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:215
jldopen at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:164 [inlined]
#jldopen#23 at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:286 [inlined]
jldopen at /home/sschult/.julia/packages/JLD2/3zWRM/src/JLD2.jl:279 [inlined]
#jldsave#107 at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:286
jldsave at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:283 [inlined]
jldsave at /home/sschult/.julia/packages/JLD2/3zWRM/src/loadsave.jl:283
unknown function (ip: 0x15525cb97c66)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
do_call at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:126
eval_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:223
eval_stmt_value at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:174 [inlined]
eval_body at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:663
jl_interpret_toplevel_thunk at /cache/build/builder-amdci5-1/julialang/julia-master/src/interpreter.c:821
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:943
jl_toplevel_eval_flex at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:886
ijl_toplevel_eval_in at /cache/build/builder-amdci5-1/julialang/julia-master/src/toplevel.c:994
eval at ./boot.jl:430 [inlined]
eval_user_input at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:226
repl_backend_loop at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:323
#start_repl_backend#59 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:308
start_repl_backend at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:305
#run_repl#72 at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:464
run_repl at /cache/build/builder-amdci5-1/julialang/julia-master/usr/share/julia/stdlib/v1.11/REPL/src/REPL.jl:450
jfptr_run_repl_10212 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
#1138 at ./client.jl:446
jfptr_YY.1138_14881 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/share/julia/compiled/v1.11/REPL/u0gqU_bFCI4.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
jl_f__call_latest at /cache/build/builder-amdci5-1/julialang/julia-master/src/builtins.c:875
#invokelatest#2 at ./essentials.jl:1054 [inlined]
invokelatest at ./essentials.jl:1051 [inlined]
run_main_repl at ./client.jl:430
repl_main at ./client.jl:567 [inlined]
_start at ./client.jl:541
jfptr__start_72051.1 at /home/sschult/.julia/juliaup/julia-1.11.0+0.x64.linux.gnu/lib/julia/sys.so (unknown line)
jl_apply at /cache/build/builder-amdci5-1/julialang/julia-master/src/julia.h:2157 [inlined]
true_main at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:900
jl_repl_entrypoint at /cache/build/builder-amdci5-1/julialang/julia-master/src/jlapi.c:1059
main at /cache/build/builder-amdci5-1/julialang/julia-master/cli/loader_exe.c:58
__libc_start_call_main at /lib64/libc.so.6 (unknown line)
__libc_start_main at /lib64/libc.so.6 (unknown line)
unknown function (ip: 0x4010b8)
Allocations: 1547120 (Pool: 1547037; Big: 83); GC: 3
Illegal instruction (core dumped)

The instruction in question appears to be vpbroadcastq from the AVX-512 instruction set, which, indeed, the Intel Xeon W-2223 supports and the AMD EPYC 7302 does not.

  • This works without errors in 1.10.5.
  • This problem is not specific to JLD2. I have obtained similar errors when using, for example, CairoMakie.jl, CUDA.jl or Arrow.jl, which error on other instructions from the same set.
  • If I instead use env2 on another node with e.g. an Intel Xeon E5-2698 v3, which also does not support AVX-512, the behaviour is different: precompilation is triggered, and no error is thrown.
  • If I delete .julia/compiled/v1.11, and first load env2 on the compute node, everything works fine, until I use env1 on my workstation, after which the same error occurs on the compute node.
@gbaraldi
Copy link
Member

This code shouldn't have been allowed to load :. @vchuravy

@gbaraldi gbaraldi added the regression 1.11 Regression in the 1.11 release label Oct 15, 2024
@giordano
Copy link
Contributor

For debugging this, before loading JLD2 do

ENV["JULIA_DEBUG"] = "loading"
using JLD2

You should see a line like

┌ Debug: Loading object cache file /depot/compiled/v1.11/JLD2/bla_blah.so for JLD2 [...]
└ @ Base loading.jl:1203

Then run (basically you need to replace the extension .so of the object cache file above with .ji):

Base.parse_image_targets(Base.parse_cache_header("/depot/compiled/v1.11/JLD2/bla_blah.ji")[7])

What do you get here?

@Vobarkun
Copy link
Author

julia> Base.parse_image_targets(Base.parse_cache_header("/home/sschult/.julia/compiled/v1.11/JLD2/O1EyT_NIQbS.ji")[7])
1-element Vector{Base.ImageTarget}:
 cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sahf, lzcnt, prfchw, xsavec, xsaves)

@giordano
Copy link
Contributor

Ok, to be clear, you should do that test in the situation where you get the "wrong" JLD2 with the incompatible code. This image doesn't seem to have the avx512 feature (assuming JLD2 is indeed the offending package here).

@Vobarkun
Copy link
Author

That's exactly what I did. In fact, the same .so file is reported on both machines, and the output is exactly the same.

@Vobarkun
Copy link
Author

In 1.10, I get

julia> Base.parse_image_targets(Base.parse_cache_header("/home/sschult/.julia/compiled/v1.10/JLD2/O1EyT_NQjXZ.ji")[7])
1-element Vector{Base.ImageTarget}:
 cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, clflushopt, clwb, avx512cd, avx512bw, avx512vl, avx512vnni, sahf, lzcnt, prfchw, xsavec, xsaves)

which does include avx512, so I guess something causes this to be stored incorrectly in 1.11.

@Vobarkun
Copy link
Author

After some further testing, I found that the issue appears first in 1.11.0-rc4, with rc3 unaffected. Also, in rc3, Base.current_image_target() correctly contains the avx512 features, whereas in rc4 it does not:

:~> julia +1.11.0-rc3 -e "println(Base.current_image_targets())"
Base.ImageTarget[cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, clflushopt, clwb, avx512cd, avx512bw, avx512vl, avx512vnni, sahf, lzcnt, prfchw, xsavec, xsaves)]
:~> julia +1.11.0-rc4 -e "println(Base.current_image_targets())"
Base.ImageTarget[cascadelake; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sahf, lzcnt, prfchw, xsavec, xsaves)]

@giordano
Copy link
Contributor

That's very interesting. Would you be able to run git bisect? This is the diff: v1.11.0-rc3...v1.11.0-rc4, there are only 35 commits between the two versions, but honestly at a quick glance I can't spot a change which would affect that.

@giordano
Copy link
Contributor

giordano commented Oct 17, 2024

I don't have the time to git bisect right now, might be able to do it later, but I can confirm avx512 feature is gone also on skylake-avx512:

$ julia -E 'Base.current_image_targets()'
Base.ImageTarget[skylake-avx512; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, pku, sahf, lzcnt, prfchw, xsavec, xsaves)]

@giordano giordano changed the title Invalid instruction in v1.11 avx512 features no longer detected in target images in v1.11 Oct 17, 2024
@Vobarkun
Copy link
Author

From checking the automatic builds 50c1ea8 appears to be the first commit affected.

@giordano
Copy link
Contributor

It'd be very surprising if that was the commit affecting this, I can't see that affecting features detection: current_image_targets is simply parsing coming from the C function jl_reflect_clone_targets, which that commit doesn't touch:

julia/base/loading.jl

Lines 1735 to 1738 in 1f935af

function current_image_targets()
targets = @ccall jl_reflect_clone_targets()::Vector{UInt8}
return parse_image_targets(targets)
end

For the record, the issue seems to be solved on master (d36417b), it affects only v1.11, we need to find what fixed it (besides what caused it):

$ julia +1.10 -E 'Base.current_image_targets()'
Base.ImageTarget[znver3; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, avx512ifma, clflushopt, clwb, avx512cd, sha, avx512bw, avx512vl, avx512vbmi, pku, avx512vbmi2, shstk, gfni, vaes, vpclmulqdq, avx512vnni, avx512bitalg, avx512vpopcntdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd, avx512bf16)]
$ julia +1.11 -E 'Base.current_image_targets()'
Base.ImageTarget[znver4; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, adx, clflushopt, clwb, sha, pku, shstk, gfni, vaes, vpclmulqdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd)]
$ julia +nightly -E 'Base.current_image_targets()'
Base.ImageTarget[znver4; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, aes, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, avx512f, avx512dq, adx, avx512ifma, clflushopt, clwb, avx512cd, sha, avx512bw, avx512vl, avx512vbmi, pku, avx512vbmi2, shstk, gfni, vaes, vpclmulqdq, avx512vnni, avx512bitalg, avx512vpopcntdq, rdpid, sahf, lzcnt, sse4a, prfchw, mwaitx, xsavec, xsaves, clzero, wbnoinvd, avx512bf16)]
$ julia +nightly -e 'using InteractiveUtils; versioninfo()'
Julia Version 1.12.0-DEV.1421
Commit d36417b8230 (2024-10-17 17:37 UTC)
Build Info:
  Official https://julialang.org release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 384 × AMD EPYC 9654 96-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-18.1.7 (ORCJIT, znver4)
Threads: 1 default, 0 interactive, 1 GC (on 384 virtual cores)

@vtjnash
Copy link
Member

vtjnash commented Oct 17, 2024

That makes it sound like it might be capturing some values from the build machines and not correctly getting those from JULIA_CPU_TARGET (aka #54093)?

@vtjnash
Copy link
Member

vtjnash commented Oct 17, 2024

Although note that loading (specifically staticdata.c) is supposed to reject loading a pkgimage that requires more features than are present on the current machine, even if loading.jl makes a mistake, to prevent issues like this. So there are multiple level of errors and failures here

@giordano
Copy link
Contributor

giordano commented Oct 17, 2024

That makes it sound like it might be capturing some values from the build machines

That'd be znver2, not cascadelake, nor skylake-avx512 nor znver4, and if you compare the features on on 1.11.0(-rc4) in #56177 (comment), #56177 (comment) and #56177 (comment) (I used two different clusters) the sets are all different (my skylake-avx512 has pku in addition to what @Vobarkun has, and my znver4 has mwaitx, clzero, rdpid, sha, shstk, gfni, sse4a, wbnoinvd, vpclmulqdq, vaes in addition to my skylake-avx512)

not correctly getting those from JULIA_CPU_TARGET

The current setting of JULIA_CPU_TARGET on x86_64 is actually more restrictive than the set we showed above:

4-element Vector{Base.ImageTarget}:
 generic; flags=0; features_en=(cx16)
 sandybridge; flags=0; features_en=(sse3, pclmul, ssse3, cx16, sse4.1, sse4.2, popcnt, xsave, avx, sahf)
 haswell; flags=0; features_en=(sse3, pclmul, ssse3, fma, cx16, sse4.1, sse4.2, movbe, popcnt, xsave, avx, f16c, fsgsbase, bmi, avx2, bmi2, sahf, lzcnt)
 x86-64-v4; flags=32; features_en=()

x86-64-v4 is actually empty, the largest set is haswell, which is smaller than all the set we showed above.

@giordano
Copy link
Contributor

giordano commented Oct 17, 2024

Ok, with

git bisect reset
git bisect start
git bisect good v1.11.0-rc3
git bisect bad v1.11.0-rc4
git bisect run ./bisect.sh

and the following bisect.sh script

#!/bin/bash

export JULIA_CPU_TARGET="generic;sandybridge,-xsaveopt,clone_all;haswell,-rdrnd,base(1);x86-64-v4,-rdrnd,base(1)"

make cleanall
make -j

./julia -E 'Base.current_image_targets()' | grep avx512f

I confirmed that #55729 is indeed the culprit on the v1.11 release branch:

50c1ea848579ddc99e3c3633b85669988c2c89f2 is the first bad commit
commit 50c1ea848579ddc99e3c3633b85669988c2c89f2
Author: Ian Butterworth <[email protected]>
Date:   Wed Sep 11 11:50:05 2024 -0400

    Precompile the `@time_imports` printing so it doesn't confuse reports (#55729)

    Makes functions for the report printing that can be precompiled into the
    sysimage.

    (cherry picked from commit 255162c7197e973d0427cc11d1e0117cdd76a1bf)

 base/loading.jl                | 94 +++++++++++++++++++++++++-----------------
 contrib/generate_precompile.jl |  9 ++++
 2 files changed, 66 insertions(+), 37 deletions(-)

Note that setting JULIA_CPU_TARGET during the build is necessary to replicate the bug, it doesn't trigger without that, nor if setting JULIA_PRECOMPILE=0.

Note that I did this on an avx512 machine, which rules out bad caching properties of the CPU on the build machine: the issue happens regardless of what's the build machine.

But this also doesn't reproduce on 255162c, the merge commit of #55729 on master, so yeah, there are multiple levels of errors here.

@giordano
Copy link
Contributor

I can reproduce the issue on 255162c if I revert ad407a6, merge commit of #54471 on master, which wasn't backported to release-1.11. Sounds like backporting that PR should fix the issue.

@IanButterworth
Copy link
Member

#55729 perhaps unwisely added precompiling rand(2,2) * rand(2,2). Could that be the critical change in that PR?

@giordano
Copy link
Contributor

That's indeed the issue! It's the call to rand which break this, replacing those rand with ones solves the issue for me on v1.11.0-rc4. But I have no clue of why this is happening.

giordano added a commit that referenced this issue Oct 18, 2024
This change by itself doesn't do anything significant on `master`, but
when backported to the v1.11 branch it'll address #56177. However it'd
be great if someone could tell _why_ this fixes that issue, because it
looks very unrelated.

---------

Co-authored-by: Ian Butterworth <[email protected]>
giordano added a commit that referenced this issue Oct 19, 2024
This change by itself doesn't do anything significant on `master`, but
when backported to the v1.11 branch it'll address #56177. However it'd
be great if someone could tell _why_ this fixes that issue, because it
looks very unrelated.

---------

Co-authored-by: Ian Butterworth <[email protected]>
(cherry picked from commit f36f342)
@maleadt
Copy link
Member

maleadt commented Oct 21, 2024

I guess this can be closed now that #56239 has been merged in the backports branch.

@maleadt maleadt closed this as completed Oct 21, 2024
KristofferC pushed a commit that referenced this issue Oct 21, 2024
This change by itself doesn't do anything significant on `master`, but
when backported to the v1.11 branch it'll address #56177. However it'd
be great if someone could tell _why_ this fixes that issue, because it
looks very unrelated.

---------

Co-authored-by: Ian Butterworth <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
regression 1.11 Regression in the 1.11 release
Projects
None yet
Development

No branches or pull requests

6 participants