Attempt add single pass scan. #685

b0nes164 · 2024-09-10T05:29:43Z

Adds single pass scan with new GPU and CPU shader. Removes previous tree based scans.

b0nes164 · 2024-09-19T04:17:15Z

Although tests are passing, out of an abundance of caution, I want to add seperate tests specifically for the correctness of the scan. These should act as a litmus test for general compatibility with single-pass monoid techniques, which would be useful as more shaders are converted to single pass.

vello_shaders/shader/pathtag_scan_csdldf.wgsl

vello/src/render.rs

vello/src/shaders.rs

vello_shaders/shader/pathtag_scan_csdldf.wgsl

vello_shaders/shader/shared/pathtag.wgsl

vello_shaders/src/cpu/pathtag_scan_single.rs

Change fallback reduce pattern to match scan pattern so last thread in workgroup has the correct aggregate in registers without need for additional barrier.

raphlinus

Ok, I think I have my head around this now. It feels like this is going to be a workable pattern. I think shared memory usage can be reduced, and there are some stylistic things that can be improved.

My overriding concern is the array structure for the monoid - we somehow have the worst of both worlds where the fields aren't named and there's also repetition. I think it can be improved by leaning in more heavily to arrays.

I didn't try this, but am assuming it works. Also, the main focus of my review is on the scan wgsl shader. I did skim the rest but will rely on other reviewers for any fine details I may have missed.

I'm not approving at this point, but think it can be landed soon.

vello_shaders/shader/shared/pathtag.wgsl

vello_shaders/shader/pathtag_scan_csdldf.wgsl

DJMcNab

Some points from browsing

vello_shaders/shader/pathtag_scan_csdldf.wgsl

raphlinus

Thanks for addressing my comments. I also verified that it works (on Apple M1) with the test scenes. I think it's ready to go in, though I didn't go over it with a fine-toothed comb.

raphlinus · 2024-09-23T15:13:35Z

vello_shaders/shader/pathtag_scan_csdldf.wgsl

+                    //so try unlocking, else, keep looking back
+                    var all_complete = inc_complete.s[0];
+                    for (var i = 1u; i < PATH_MEMBERS; i += 1u) {
+                        all_complete = all_complete && inc_complete.s[i];


I think you can write this &=, but it's a very minor stylistic point.

DJMcNab · 2024-09-23T15:24:52Z

To reduce the risk from this, I'd propose that we land this after we release version 0.3.0. I'm hopeful that we can do this after office hours this week, but we'll see.

b0nes164 · 2024-09-24T21:03:04Z

Following up on my conversation with Raph, FXC is still blocking on this:

In the looped version FXC is giving the following errors:

vello.pathtag_scan_csdldf(345,45-67): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
vello.pathtag_scan_csdldf(310,21-31): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (101 iterations) or unrolled loop is too large, use the [unroll(n)] attribute to force an exact higher number
vello.pathtag_scan_csdldf(299,9-19): error X3511: forced to unroll loop, but unrolling failed.

This is because the loop inside the lookback/fallback block cannot be unrolled.

Going back to the pre-looping implementation, FXC fails to compile due to uniformity analysis:

Internal error: FXC D3DCompile error (0x80004005): pathtag_scan_csdldf(484,29-61): error X4026: thread sync operation must be in non-varying flow control, due to a potential race condition this sync is illegal, consider adding a sync after reading any values controlling shader execution at this point
pathtag_scan_csdldf(548,26-43): error X4026: this variable dependent on potentially varying data: local_id
pathtag_scan_csdldf(664,30-76): error X4026: this variable dependent on potentially varying data: _e992
pathtag_scan_csdldf(451,30-36): error X4026: this memory access dependent on potentially varying data
pathtag_scan_csdldf(452,22-36): error X4026: this variable dependent on potentially varying data: _e565
pathtag_scan_csdldf(473,29-40): error X4026: this variable dependent on potentially varying data: loop_init_1
pathtag_scan_csdldf(479,30-47): error X4026: this variable dependent on potentially varying data: _e609
pathtag_scan_csdldf(472,21-31): error X4026: this loop dependent on potentially varying data
pathtag_scan_csdldf(473,29-40): error X4026: this variable dependent on potentially varying data: loop_init_1

I think it considers the barrier on line 107 illegal because the control flow is dependent on shared memory.

raphlinus · 2024-09-24T21:15:50Z

Ah, very interesting. Looking at the code, I'm surprised this runs at all, the WGSL uniformity analysis should reject it (if it doesn't, that's a bug in naga). I haven't tried it, but I'd expect Chrome/tint to also reject it.

Fortunately, it should be fairly easily fixable. Instead of using sh_lock directly in the loop predicate, do let lock = workgroupUniformLoad(&sh_lock) and predicate off that (this intrinsic also contains a barrier, so the separate workgroupBarrier can be eliminated).

I don't have my head around the loop unrolling yet. This is just PATH_MEMBERS, no? In which case you'd think the compiler would be able to unroll it a known number of times. One thing I noticed: we should now be able to use const instead of let (when the shaders were originally written, naga didn't support const), but I'd be surprised if that made a difference at the FXC level. It's possible that going back to manually unrolling the loops is an acceptable workaround.

One other thing I'd try is putting the literal 5 in the loop bounds.

I also have a theory why FXC might be rejecting this. I believe that naga compiles a for loop into an infinite loop with conditional break, rather than a straightforward HLSL for loop. That might be enough to put it out of reach of FXC's loop termination heuristic. Potentially that's worth raising as a naga issue - in addition to certain valid programs being rejected, I have a bit of evidence (from Metal) that it results in suboptimal code generation.

b0nes164 · 2024-09-24T22:38:54Z

After changing sh_lock to use workgroupUniformLoad, the shader does indeed pass uniformity analysis. I also had to change the read on line 170.

As for the loop unrolling, I think you're dead on with regard to the naga transpilation. Even after changing to const, subbing literals into the loops, and changing the spin_count loop to a fixed number, it's still complaining about loop unrolling, and in places that should be fine, like the initial local scan. Using the naga cli, it does indeed compile to while loops.

With regard to metal, I had the same exact performance issues. I think I was getting like 60% of the speed from wgsl->naga->msl as opposed to glsl->glslangvalidator->SPIR-V->spirv_cross->msl, and that's after I unrolled the device level loads/stores by hand.

b0nes164 added 7 commits September 9, 2024 22:01

Add single pass scan. Remove tree reductions.

24dbaa7

Update buffer name for clarity.

d16c3b2

Change scan from struct to fixed size array to allow dynamic indexing.

7c3e604

Remove unused workgroup and buffer sizes.

d26945b

Merge branch 'main' into SinglePass

1356276

Cargo fmt fix

383d37e

Fix incorrect transcription of function.

28193eb

waywardmonkeys reviewed Sep 19, 2024

View reviewed changes

vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved

vello/src/render.rs Outdated Show resolved Hide resolved

DJMcNab reviewed Sep 19, 2024

View reviewed changes

Style fixes

76235ca

b0nes164 force-pushed the SinglePass branch from 76443fd to 76235ca Compare September 19, 2024 16:10

b0nes164 added 2 commits September 19, 2024 10:23

Fix bug in fallback, remove unused permutations

cf24cde

Change fallback reduce pattern to match scan pattern so last thread in workgroup has the correct aggregate in registers without need for additional barrier.

Further shader style fixes.

a9786e1

raphlinus reviewed Sep 19, 2024

View reviewed changes

b0nes164 added 2 commits September 19, 2024 19:54

Refactor shader to use loops. Change reduce_tag to marshaling function.

b7652e4

Improve lookback logic, update pathtag_scan_single.rs to pathtag_scan.rs

99ac32b

DJMcNab reviewed Sep 20, 2024

View reviewed changes

vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved

vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved

b0nes164 added 4 commits September 20, 2024 08:29

Style fixes, fix u32 bug, force fallback to check correctness.

dc1e4b4

Fix incorrect transcription of fallback function.

5ca145b

Rename state_wrapper during fallback for clarity

54f1472

Change fallback broadcast to register based.

296bbbf

DJMcNab requested a review from raphlinus September 23, 2024 10:21

raphlinus approved these changes Sep 23, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt add single pass scan. #685

Attempt add single pass scan. #685

b0nes164 commented Sep 10, 2024

b0nes164 commented Sep 19, 2024

raphlinus left a comment

DJMcNab left a comment

raphlinus left a comment

raphlinus Sep 23, 2024

DJMcNab commented Sep 23, 2024

b0nes164 commented Sep 24, 2024 •

edited

Loading

raphlinus commented Sep 24, 2024 •

edited

Loading

b0nes164 commented Sep 24, 2024

Attempt add single pass scan. #685

Are you sure you want to change the base?

Attempt add single pass scan. #685

Conversation

b0nes164 commented Sep 10, 2024

b0nes164 commented Sep 19, 2024

raphlinus left a comment

Choose a reason for hiding this comment

DJMcNab left a comment

Choose a reason for hiding this comment

raphlinus left a comment

Choose a reason for hiding this comment

raphlinus Sep 23, 2024

Choose a reason for hiding this comment

DJMcNab commented Sep 23, 2024

b0nes164 commented Sep 24, 2024 • edited Loading

raphlinus commented Sep 24, 2024 • edited Loading

b0nes164 commented Sep 24, 2024

b0nes164 commented Sep 24, 2024 •

edited

Loading

raphlinus commented Sep 24, 2024 •

edited

Loading