Sort experiment #31

raphlinus · 2024-01-20T01:48:59Z

This branch contains an experiment in sorting. It is not intended to be merged, but having a draft PR gives the branch a stable identifier.

The tip contains an implementation mostly adapted from FidelityFX sort, but with a version of warp-local multi-split inspired by Onesweep. In all cases, subgroup operations have been replaced by workgroup shared memory. There are numerous checkpoints, including a mostly-working version without the WLMS and closer to the original FidelityFX. Note, however, that this exhibits failures consistent with a missing barrier. The tip appears to pass correctness tests, but none of this has been carefully validated.

Sort throughput is approximately 1G element/s on M1 Max.

Updates dependencies to latest published crates, including wgpu 0.18 and winit 0.28.

WIP

Have written reduce and scan but not tested

The shaders are mostly written, with some TODOs, but haven't been tested.

Starting to wire up sizes, buffer bindings, etc., in preparation for actually running the pipeline.

The count stage seems to be generating correct output. Next step is wiring up sums.

It seems like reduce works.

Fix some sizing issues, seems to get to top-level scan correctly now.

The prefix sum stages seem to be generating correct output.

This seems to be a working scatter, which means that the core of the algorithm is done. Also just starting to look at performance characteristics. That's why there's a simpler count stage, the huge shared array seemed to be a problem.

The sort pipeline is wired up, and results are close to being sorted, but there are zero elements in the output.

This sorts up to 2^16, but fails at 2^17.

Multisplit appears to work in isolation, we'll see whether that holds up.

It works (and doesn't seem to have the same problem as the scatter from Fidelity), but seems to be a bit slower than that. Perhaps that can be improved (subgroups would obviously help a lot), and it's also possible it would unlock going to 8 bits per pass.

Just iterate all the keys, it's faster. Also suggests a substantial fraction of all time is going into the ballot.

We can use either 16 or 32 for warp size. The former is faster (on M1 Max). 8 is also a possibility but then the size of the histogram array would exceed the workgroup, so threads would need to deal with multiple histogram values. Quick experiments with ELEMENTS_PER_THREAD show no gains for values other than 4.

raphlinus added 17 commits December 24, 2023 09:52

Update to wgpu 0.18

d928fee

Updates dependencies to latest published crates, including wgpu 0.18 and winit 0.28.

Start fidelityfx impl

646199e

WIP

Checkpoint

43e632b

Have written reduce and scan but not tested

Checkpoint shaders written

9ae99d6

The shaders are mostly written, with some TODOs, but haven't been tested.

Checkpoint

74eadb4

Starting to wire up sizes, buffer bindings, etc., in preparation for actually running the pipeline.

Checkpoint correct counting

968f52d

The count stage seems to be generating correct output. Next step is wiring up sums.

Checkpoint reduce

a833f11

It seems like reduce works.

Checkpoint working reduce and scan

48957c3

Fix some sizing issues, seems to get to top-level scan correctly now.

Checkpoint scans

3a1c094

The prefix sum stages seem to be generating correct output.

Checkpoint scatter

9d79a07

This seems to be a working scatter, which means that the core of the algorithm is done. Also just starting to look at performance characteristics. That's why there's a simpler count stage, the huge shared array seemed to be a problem.

Checkpoint almost working sort

063e936

The sort pipeline is wired up, and results are close to being sorted, but there are zero elements in the output.

Checkpoint sorts medium sized arrays

8a920e7

This sorts up to 2^16, but fails at 2^17.

Checkpoint non-working WLMS

3eee575

Seemingly working WLMS

f10ddca

Multisplit appears to work in isolation, we'll see whether that holds up.

Checkpoint working WLMS sort

38de0be

It works (and doesn't seem to have the same problem as the scatter from Fidelity), but seems to be a bit slower than that. Perhaps that can be improved (subgroups would obviously help a lot), and it's also possible it would unlock going to 8 bits per pass.

Checkpoint simpler ballot

00a3079

Just iterate all the keys, it's faster. Also suggests a substantial fraction of all time is going into the ballot.

raphlinus mentioned this pull request Mar 18, 2024

Flicking problem linebender/vello#334

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sort experiment #31

Sort experiment #31

raphlinus commented Jan 20, 2024

Sort experiment #31

Are you sure you want to change the base?

Sort experiment #31

Conversation

raphlinus commented Jan 20, 2024