Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort experiment #31

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Sort experiment #31

wants to merge 17 commits into from

Conversation

raphlinus
Copy link
Contributor

This branch contains an experiment in sorting. It is not intended to be merged, but having a draft PR gives the branch a stable identifier.

The tip contains an implementation mostly adapted from FidelityFX sort, but with a version of warp-local multi-split inspired by Onesweep. In all cases, subgroup operations have been replaced by workgroup shared memory. There are numerous checkpoints, including a mostly-working version without the WLMS and closer to the original FidelityFX. Note, however, that this exhibits failures consistent with a missing barrier. The tip appears to pass correctness tests, but none of this has been carefully validated.

Sort throughput is approximately 1G element/s on M1 Max.

Updates dependencies to latest published crates, including wgpu 0.18 and winit 0.28.
Have written reduce and scan but not tested
The shaders are mostly written, with some TODOs, but haven't been tested.
Starting to wire up sizes, buffer bindings, etc., in preparation for actually running the pipeline.
The count stage seems to be generating correct output. Next step is wiring up sums.
It seems like reduce works.
Fix some sizing issues, seems to get to top-level scan correctly now.
The prefix sum stages seem to be generating correct output.
This seems to be a working scatter, which means that the core of the algorithm is done. Also just starting to look at performance characteristics. That's why there's a simpler count stage, the huge shared array seemed to be a problem.
The sort pipeline is wired up, and results are close to being sorted, but there are zero elements in the output.
This sorts up to 2^16, but fails at 2^17.
Multisplit appears to work in isolation, we'll see whether that holds up.
It works (and doesn't seem to have the same problem as the scatter from Fidelity), but seems to be a bit slower than that. Perhaps that can be improved (subgroups would obviously help a lot), and it's also possible it would unlock going to 8 bits per pass.
Just iterate all the keys, it's faster. Also suggests a substantial fraction of all time is going into the ballot.
We can use either 16 or 32 for warp size. The former is faster (on M1 Max). 8 is also a possibility but then the size of the histogram array would exceed the workgroup, so threads would need to deal with multiple histogram values.

Quick experiments with ELEMENTS_PER_THREAD show no gains for values other than 4.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant