Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes to winding number accumulation #391

Merged
merged 9 commits into from
Oct 30, 2023
Merged

Fixes to winding number accumulation #391

merged 9 commits into from
Oct 30, 2023

Conversation

raphlinus
Copy link
Contributor

This patch fixes rendering artifacts related to winding number accumulation. The first problem is that 4 bits per pixel is not adequate for all test scenes (the cardioid one in particular exercises this). The second problem is that we didn't have a case for even-odd.

This patch is expected to regress performance for the nonzero fill rule, but improve it for even-odd. Some of the performance loss is decreased occupancy because of a larger shared memory footprint for the accumulation, some is because of increased atomic traffic. We should do some performance work as a followup, first characterizing the performance impact, then doing some mitigation. One promising idea is to specialize further based on the number of segments in the slice, using 4 bit accumulation when we can guarantee that won't overflow.

This will have a (small) merge conflict with #382. If that's landed first, I'll happily update this one.

Fixes #380 and #390.

Changes the winding number accumulation in msaa8 mode to 8 bits per sample. This is prototype code and currently breaks the msaa16 mode; it is intended to diagnose whether the artifacts are strictly due to overflow, and to point the way to a real implementation.

Prefix sums in both x and y direction are a little cleaner, avoiding a race (not UB because it's atomics).
This patch makes the msaa16 mode work again, using 8 bit accumulation of winding numbers. It could be merged to fix the artifacts in the cardioid example. Also, it's worth doing some evaluation to see how much performance slowdown there is.

As future work, we probably want to be adaptive and use 8 bit accumulation when needed. If the performance hit from reduced occupancy due to the increased shared memory usage is significant, then we could consider other mitigations, including downgrading to msaa8 when overflow is possible.
Create a specialized version of the fill function for the even-odd fill rule. The logic is simpler (and faster) because winding number accumulation can happen in one bit.

There's a bunch of code duplication which can be cleaned up.

It's expected this will have a merge conflict with #382. If that's merged first, I'll happily fix this one.
Copy link
Collaborator

@armansito armansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fixes. I think the new approach makes sense but as we discussed today, a lot of the logic is very dense/involved and could use some clarification.

LGTM otherwise.

let even_odd = (fill.size_and_rule & 1u) != 0u;
if even_odd {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document your reasoning around divergence here: that PTCL commands are expected to get processed uniformly.

atomicStore(&sh_winding_y[th_ix], 0x88888888u);
if th_ix < 64u {
if th_ix < 4u {
atomicStore(&sh_winding_y[th_ix], 0x80808080u);
Copy link
Collaborator

@armansito armansito Oct 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense to me but maybe briefly document the SWAR packing here for posterity, i.e. this stores 4 8-bit winding numbers.

IWBN to document the counting scheme and representation here (based on our conversation today, this is using a biased offset binary representation but this wasn't obvious to me, as I initially assumed 2's complement and couldn't make sense of the initial values).

let expected_zero = (((packed_w >> (minor * 4u)) + wind_y) & 0xfu) - u32(fill.backdrop);
if expected_zero >= 16u {
let minor = i; // assumes PIXELS_PER_THREAD == 4
let expected_zero = (((packed_w >> (minor * 8u)) + wind_y) & 0xffu) - u32(fill.backdrop);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please document and explain the logic behind expected_zero.

shader/fine.wgsl Outdated
// This is 8 winding numbers packed to a u32, 4 bits per sample
var<workgroup> sh_winding: array<atomic<u32>, 32u>;
// This is 4 winding numbers packed to a u32, 8 bits per sample
var<workgroup> sh_winding: array<atomic<u32>, 64u>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we discussed today, please document not only the packing pattern but what each shared buffer semantically represents in the hierarchy of counts.

#endif
#ifdef msaa16
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try to document your invariants around the SWAR math as best as you can. Perhaps even add a note that refers the reader to the even-odd version as it's easier to understand.

@@ -73,7 +73,7 @@ enum AaConfig {

/// Configuration of antialiasing. Currently this is static, but could be switched to
/// a launch option or even finer-grained.
const ANTIALIASING: AaConfig = AaConfig::Area;
const ANTIALIASING: AaConfig = AaConfig::Msaa16;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make Msaa16 the default yet? Perhaps this should remain as set to area for now as we may not want to get hit by the performance cost of increased workgroup memory in Msaa16 mode.

That said, I think I would like to make this configurable at runtime. I think a reasonable approach is to introduce two levels of settings: a RendererOptions setting that tells the library which pipeline variants to compile/enable and a RenderParams setting that selects a particular AA mode (which can eventually have a "dynamic" option if we decide to support that). That way we can run the test suite in various modes without having to re-compile the code.

If you agree with that abstraction, I'd be happy to wire that up in a follow-up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's good making Msaa16 the default at least temporarily, if for no other reason than to exercise it. I also agree that it should be configurable, but the mechanism to do that is nontrivial (thanks for agreeing to take it on!) and seems like it can be done separately from this PR.

I did my best to document some of the strange bit magic used in the algorithm. I also did just a bit of renaming to make things simpler, and for the mask expansion replaced `|` with `^` because it's easier to understand in terms of carry-less multiplication (and I expect performance to be identical).
Copy link
Collaborator

@armansito armansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the documentation!

shader/fine.wgsl Outdated
var<workgroup> sh_winding: array<atomic<u32>, 64u>;
// This array contains winding numbers of multiple sample points within
// a pixel, relative to the winding number of the top left corner of the
// pixels. The encoding and packing is the same as `sh_winding_y`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pixels. The encoding...

of the pixel (singular) or is this relative to a group of pixels?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth between singular and plural. Fixed.

@raphlinus raphlinus merged commit 3fab38a into main Oct 30, 2023
4 checks passed
@raphlinus raphlinus deleted the swar8 branch October 30, 2023 16:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[MSAA] Artifacts in cardioid with 8xMSAA and 16xMSAA
2 participants