Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt add single pass scan. #685

Open
wants to merge 16 commits into
base: main
Choose a base branch
from
Open

Conversation

b0nes164
Copy link

Adds single pass scan with new GPU and CPU shader. Removes previous tree based scans.

@b0nes164
Copy link
Author

Although tests are passing, out of an abundance of caution, I want to add seperate tests specifically for the correctness of the scan. These should act as a litmus test for general compatibility with single-pass monoid techniques, which would be useful as more shaders are converted to single pass.

vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
vello/src/render.rs Outdated Show resolved Hide resolved
vello/src/shaders.rs Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
vello_shaders/shader/shared/pathtag.wgsl Outdated Show resolved Hide resolved
vello_shaders/src/cpu/pathtag_scan_single.rs Outdated Show resolved Hide resolved
Change fallback reduce pattern to match scan pattern so last thread in workgroup has the correct aggregate in registers without need for additional barrier.
Copy link
Contributor

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I think I have my head around this now. It feels like this is going to be a workable pattern. I think shared memory usage can be reduced, and there are some stylistic things that can be improved.

My overriding concern is the array structure for the monoid - we somehow have the worst of both worlds where the fields aren't named and there's also repetition. I think it can be improved by leaning in more heavily to arrays.

I didn't try this, but am assuming it works. Also, the main focus of my review is on the scan wgsl shader. I did skim the rest but will rely on other reviewers for any fine details I may have missed.

I'm not approving at this point, but think it can be landed soon.

vello_shaders/shader/shared/pathtag.wgsl Outdated Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some points from browsing

vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
vello_shaders/shader/pathtag_scan_csdldf.wgsl Outdated Show resolved Hide resolved
Copy link
Contributor

@raphlinus raphlinus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing my comments. I also verified that it works (on Apple M1) with the test scenes. I think it's ready to go in, though I didn't go over it with a fine-toothed comb.

//so try unlocking, else, keep looking back
var all_complete = inc_complete.s[0];
for (var i = 1u; i < PATH_MEMBERS; i += 1u) {
all_complete = all_complete && inc_complete.s[i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can write this &=, but it's a very minor stylistic point.

@DJMcNab
Copy link
Member

DJMcNab commented Sep 23, 2024

To reduce the risk from this, I'd propose that we land this after we release version 0.3.0. I'm hopeful that we can do this after office hours this week, but we'll see.

@b0nes164
Copy link
Author

b0nes164 commented Sep 24, 2024

Following up on my conversation with Raph, FXC is still blocking on this:

In the looped version FXC is giving the following errors:

vello.pathtag_scan_csdldf(345,45-67): warning X3550: array reference cannot be used as an l-value; not natively addressable, forcing loop to unroll
vello.pathtag_scan_csdldf(310,21-31): error X3511: unable to unroll loop, loop does not appear to terminate in a timely manner (101 iterations) or unrolled loop is too large, use the [unroll(n)] attribute to force an exact higher number
vello.pathtag_scan_csdldf(299,9-19): error X3511: forced to unroll loop, but unrolling failed.

This is because the loop inside the lookback/fallback block cannot be unrolled.

Going back to the pre-looping implementation, FXC fails to compile due to uniformity analysis:

Internal error: FXC D3DCompile error (0x80004005): pathtag_scan_csdldf(484,29-61): error X4026: thread sync operation must be in non-varying flow control, due to a potential race condition this sync is illegal, consider adding a sync after reading any values controlling shader execution at this point
pathtag_scan_csdldf(548,26-43): error X4026: this variable dependent on potentially varying data: local_id
pathtag_scan_csdldf(664,30-76): error X4026: this variable dependent on potentially varying data: _e992
pathtag_scan_csdldf(451,30-36): error X4026: this memory access dependent on potentially varying data
pathtag_scan_csdldf(452,22-36): error X4026: this variable dependent on potentially varying data: _e565
pathtag_scan_csdldf(473,29-40): error X4026: this variable dependent on potentially varying data: loop_init_1
pathtag_scan_csdldf(479,30-47): error X4026: this variable dependent on potentially varying data: _e609
pathtag_scan_csdldf(472,21-31): error X4026: this loop dependent on potentially varying data
pathtag_scan_csdldf(473,29-40): error X4026: this variable dependent on potentially varying data: loop_init_1

I think it considers the barrier on line 107 illegal because the control flow is dependent on shared memory.

@raphlinus
Copy link
Contributor

raphlinus commented Sep 24, 2024

Ah, very interesting. Looking at the code, I'm surprised this runs at all, the WGSL uniformity analysis should reject it (if it doesn't, that's a bug in naga). I haven't tried it, but I'd expect Chrome/tint to also reject it.

Fortunately, it should be fairly easily fixable. Instead of using sh_lock directly in the loop predicate, do let lock = workgroupUniformLoad(&sh_lock) and predicate off that (this intrinsic also contains a barrier, so the separate workgroupBarrier can be eliminated).

I don't have my head around the loop unrolling yet. This is just PATH_MEMBERS, no? In which case you'd think the compiler would be able to unroll it a known number of times. One thing I noticed: we should now be able to use const instead of let (when the shaders were originally written, naga didn't support const), but I'd be surprised if that made a difference at the FXC level. It's possible that going back to manually unrolling the loops is an acceptable workaround.

One other thing I'd try is putting the literal 5 in the loop bounds.

I also have a theory why FXC might be rejecting this. I believe that naga compiles a for loop into an infinite loop with conditional break, rather than a straightforward HLSL for loop. That might be enough to put it out of reach of FXC's loop termination heuristic. Potentially that's worth raising as a naga issue - in addition to certain valid programs being rejected, I have a bit of evidence (from Metal) that it results in suboptimal code generation.

@b0nes164
Copy link
Author

After changing sh_lock to use workgroupUniformLoad, the shader does indeed pass uniformity analysis. I also had to change the read on line 170.

As for the loop unrolling, I think you're dead on with regard to the naga transpilation. Even after changing to const, subbing literals into the loops, and changing the spin_count loop to a fixed number, it's still complaining about loop unrolling, and in places that should be fine, like the initial local scan. Using the naga cli, it does indeed compile to while loops.

With regard to metal, I had the same exact performance issues. I think I was getting like 60% of the speed from wgsl->naga->msl as opposed to glsl->glslangvalidator->SPIR-V->spirv_cross->msl, and that's after I unrolled the device level loads/stores by hand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants