-
Notifications
You must be signed in to change notification settings - Fork 136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes to winding number accumulation #391
Conversation
Changes the winding number accumulation in msaa8 mode to 8 bits per sample. This is prototype code and currently breaks the msaa16 mode; it is intended to diagnose whether the artifacts are strictly due to overflow, and to point the way to a real implementation. Prefix sums in both x and y direction are a little cleaner, avoiding a race (not UB because it's atomics).
This patch makes the msaa16 mode work again, using 8 bit accumulation of winding numbers. It could be merged to fix the artifacts in the cardioid example. Also, it's worth doing some evaluation to see how much performance slowdown there is. As future work, we probably want to be adaptive and use 8 bit accumulation when needed. If the performance hit from reduced occupancy due to the increased shared memory usage is significant, then we could consider other mitigations, including downgrading to msaa8 when overflow is possible.
Create a specialized version of the fill function for the even-odd fill rule. The logic is simpler (and faster) because winding number accumulation can happen in one bit. There's a bunch of code duplication which can be cleaned up. It's expected this will have a merge conflict with #382. If that's merged first, I'll happily fix this one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the fixes. I think the new approach makes sense but as we discussed today, a lot of the logic is very dense/involved and could use some clarification.
LGTM otherwise.
let even_odd = (fill.size_and_rule & 1u) != 0u; | ||
if even_odd { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document your reasoning around divergence here: that PTCL commands are expected to get processed uniformly.
atomicStore(&sh_winding_y[th_ix], 0x88888888u); | ||
if th_ix < 64u { | ||
if th_ix < 4u { | ||
atomicStore(&sh_winding_y[th_ix], 0x80808080u); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense to me but maybe briefly document the SWAR packing here for posterity, i.e. this stores 4 8-bit winding numbers.
IWBN to document the counting scheme and representation here (based on our conversation today, this is using a biased offset binary representation but this wasn't obvious to me, as I initially assumed 2's complement and couldn't make sense of the initial values).
let expected_zero = (((packed_w >> (minor * 4u)) + wind_y) & 0xfu) - u32(fill.backdrop); | ||
if expected_zero >= 16u { | ||
let minor = i; // assumes PIXELS_PER_THREAD == 4 | ||
let expected_zero = (((packed_w >> (minor * 8u)) + wind_y) & 0xffu) - u32(fill.backdrop); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please document and explain the logic behind expected_zero
.
shader/fine.wgsl
Outdated
// This is 8 winding numbers packed to a u32, 4 bits per sample | ||
var<workgroup> sh_winding: array<atomic<u32>, 32u>; | ||
// This is 4 winding numbers packed to a u32, 8 bits per sample | ||
var<workgroup> sh_winding: array<atomic<u32>, 64u>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we discussed today, please document not only the packing pattern but what each shared buffer semantically represents in the hierarchy of counts.
#endif | ||
#ifdef msaa16 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to document your invariants around the SWAR math as best as you can. Perhaps even add a note that refers the reader to the even-odd version as it's easier to understand.
@@ -73,7 +73,7 @@ enum AaConfig { | |||
|
|||
/// Configuration of antialiasing. Currently this is static, but could be switched to | |||
/// a launch option or even finer-grained. | |||
const ANTIALIASING: AaConfig = AaConfig::Area; | |||
const ANTIALIASING: AaConfig = AaConfig::Msaa16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make Msaa16
the default yet? Perhaps this should remain as set to area for now as we may not want to get hit by the performance cost of increased workgroup memory in Msaa16 mode.
That said, I think I would like to make this configurable at runtime. I think a reasonable approach is to introduce two levels of settings: a RendererOptions
setting that tells the library which pipeline variants to compile/enable and a RenderParams
setting that selects a particular AA mode (which can eventually have a "dynamic" option if we decide to support that). That way we can run the test suite in various modes without having to re-compile the code.
If you agree with that abstraction, I'd be happy to wire that up in a follow-up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's good making Msaa16
the default at least temporarily, if for no other reason than to exercise it. I also agree that it should be configurable, but the mechanism to do that is nontrivial (thanks for agreeing to take it on!) and seems like it can be done separately from this PR.
I did my best to document some of the strange bit magic used in the algorithm. I also did just a bit of renaming to make things simpler, and for the mask expansion replaced `|` with `^` because it's easier to understand in terms of carry-less multiplication (and I expect performance to be identical).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you very much for the documentation!
shader/fine.wgsl
Outdated
var<workgroup> sh_winding: array<atomic<u32>, 64u>; | ||
// This array contains winding numbers of multiple sample points within | ||
// a pixel, relative to the winding number of the top left corner of the | ||
// pixels. The encoding and packing is the same as `sh_winding_y`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pixels. The encoding...
of the pixel (singular) or is this relative to a group of pixels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went back and forth between singular and plural. Fixed.
This patch fixes rendering artifacts related to winding number accumulation. The first problem is that 4 bits per pixel is not adequate for all test scenes (the cardioid one in particular exercises this). The second problem is that we didn't have a case for even-odd.
This patch is expected to regress performance for the nonzero fill rule, but improve it for even-odd. Some of the performance loss is decreased occupancy because of a larger shared memory footprint for the accumulation, some is because of increased atomic traffic. We should do some performance work as a followup, first characterizing the performance impact, then doing some mitigation. One promising idea is to specialize further based on the number of segments in the slice, using 4 bit accumulation when we can guarantee that won't overflow.
This will have a (small) merge conflict with #382. If that's landed first, I'll happily update this one.
Fixes #380 and #390.