-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Atomic coherency test #11
base: main
Are you sure you want to change the base?
Conversation
This overwrites the compute-shader-hello example to be a test of atomic coherency. My understanding is that with the barriers it should run with 0 failures, even in strategy 0. With strategy 1 (atomicOr) as a workaround, it seems to be working.
a70dd7f
to
0594d07
Compare
A bit of explanation: this is an attempt to make a clean, isolated test for some atomic coherence failures I've been observing. It's written in WGSL and uses WebGPU, but on the system I've been exploring (AMD 5700XT), I am seeing similar failures in a Vulkan-only context. One advantage to having the test case be in WGSL is that there is one concepts of what the WGSL memory model should be, even if it isn't written down yet. By contrast, in Vulkan-land, the ideas are spread across Vulkan, SPIR-V, and (to a lesser extent) GLSL. Further, there is considerable diversity in how much the memory model is "turned on," in particular whether SPV 1.5 turns on the VulkanMemoryModel capability and whether the corresponding feature is enabled in device creation; the sense I get is that this is fairly rare, and it's far more common to be running in some kind of hybrid compatibility mode. In any case, it's my feeling that the test in this PR should always pass. It fails on 5700XT. I find that there are two workarounds, which I think sheds some light on the issue. The first is to use ordinary loads and stores rather than atomics, and to enable the coherent qualifier on the buffer. This is basically what the current piet-gpu code does, as those shaders are written in GLSL with an eye towards compatibility, ie not requiring an explicit Vulkan memory model feature. However, this is not consistent with the current approach for translating WGSL. It's possible that the recommended WGSL translation should change to be more like this, however, for better compatibility. One way of thinking about this is that there's no way (I know of) to get current WGSL translation output from a GLSL, so making it resemble existing GLSL translation output might reduce the number of exciting and fun bugs (such as this one). The second is to use atomicOr(, 0) instead of atomicLoad for the load of the data word. By my reading, the semantics of that should be identical (wrt the yet-to-be-written WGSL and the actual Vulkan memory models), but the former conveys more of an intent of "actually go to memory rather than cache". I can see doing that to unblock my work in getting a webgpu prefix sum implementation out there. I've done a number of experiments to validate that this is downstream of wgpu's shader translation (naga). One is to use tint to generate spv. Another is to manually add OpMemberDecorate Coherent to the data buffer (which I now understand is not expected to make any difference for purely atomic operations; it affects the handling of ordinary loads and stores). Finally, I've done an experiment of fully enabling the vulkan memory model (on Vk 1.2 device creation), and adding MakeAvailable+Release / MakeVisible+Acquire memory semantics flags to the relevant atomics (this would not be viable from wgsl sources, as only relaxed semantics are currently spec'ed). None of these make a difference. I'd like to get some form of this into the webgpu test suite. That should hopefully have the effect of shaking out bugs in implementations. |
Another fascinating datapoint: if I fully enable the Vulkan memory model and change the load of the data word to Also note, I have to hand-edit spv for this to pass validation. If I simply add these flags and use glslangValidator to compile, it produces output that's missing |
This is probably the version of the test I want to promote.
8602338
to
fc7dc5d
Compare
The test is both simpler and more reliable in catching problems. There's no looping in the shader, so it's basically just the classic "message passing" litmus test, just run in parallel and with a simple permutation (multiplication by primes mod 64k) to mix up memory acccesses. It also runs 1000 dispatches. This reliably catches thousands of failures on my 5700XT.
Slightly cleaner setup of shader source.
The previous version of the test had a race between a nonatomic store and a nonatomic load, because the load was not conditional on the atomic flag. Adding an "if" guard makes the weak behavior observed on AMD 5700 XT go away. If those memory accesses are changed to atomics, then it's no longer a race, so the observed reorderings seem to be actual failures of the GPU relative to the Vulkan memory model. A typical run has 64M measurements (64k per dispatch, 1000 iterations), of which about 5000 have a flag = 1 but data = 0.
I just pushed a new version of the test which uses atomics for both data and flag. The previous version was broken, as there was a race between nonatomic load and store of the data field. I believe the new version unambiguously demonstrates failures with respect to the Vulkan memory model. Since all memory access is atomic, it doesn't implicate the availability/visibility issues as discussed in gpuweb/gpuweb#2228, and also I believe both the tint and naga translations of this test into SPIR-V are correct. |
This overwrites the compute-shader-hello example to be a test of atomic
coherency. My understanding is that with the barriers it should run with
0 failures, even in strategy 0. With strategy 1 (atomicOr) as a
workaround, it seems to be working.