Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CPU shaders #374

Merged
merged 3 commits into from
Oct 11, 2023
Merged

Add CPU shaders #374

merged 3 commits into from
Oct 11, 2023

Conversation

raphlinus
Copy link
Contributor

This commit adds a full pipeline of CPU shaders for the compute stage of the pipeline (not counting fine rasterization). It's based on what's in the multi branch, but also ports bounding box fixes and other cleanups.

Currently the selection of shaders is all or nothing (based on the use-cpu flag), but will probably become more fine-grained.

Going forward, the intent is to keep CPU and GPU shaders in sync. It has been quite valuable in diagnosing problems in the pipeline.

This commit adds a full pipeline of CPU shaders for the compute stage of the pipeline (not counting fine rasterization). It's based on what's in the multi branch, but also ports bounding box fixes and other cleanups.

Currently the selection of shaders is all or nothing (based on the use-cpu flag), but will probably become more fine-grained.

Going forward, the intent is to keep CPU and GPU shaders in sync. It has been quite valuable in diagnosing problems in the pipeline.
@eliasnaur
Copy link
Collaborator

Can the CPU shaders be UNLICENSE too, like their GPU counterparts? I'm targeting another port of Vello into the Gio graphics pipeline and one of the features we absolutely need is a CPU fallback.

not counting fine rasterization

May I ask what the plans are for the fine rasterization stage? Will Vello never run on platforms without GPU support for compute?

Copy link
Member

@DJMcNab DJMcNab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments. I've not read through the shader code

crates/encoding/src/path.rs Outdated Show resolved Hide resolved
}

pub enum TypedBufGuardMut<'a, T: ?Sized> {
Slice(&'a mut T),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variant ever constructed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, I'd have to check.

/// Second, the actual mapping to CPU shaders is not really specific to
/// the engine, and should be split out into a back-end agnostic struct.
pub fn install_cpu_shaders(&mut self, engine: &mut WgpuEngine) {
engine.set_cpu_shader(self.pathtag_reduce, cpu_shader::pathtag_reduce);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh neat - this is much less noisy

impl Transform {
pub fn apply(&self, p: Vec2) -> Vec2 {
let z = self.0;
let x = z[0] * p.x + z[2] * p.y + z[4];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not checked these operations, but am reminded of #283

Adds Unlicense to the SPDX line, and also relaxes the 64k (or any) limit on the number of pathtags in the CPU case.

The CPU shaders don't exactly match the GPU in the "use_large_path_scan" case, but they do work.
@raphlinus
Copy link
Contributor Author

Ok, I've fixed the crashes with large pathtags (though it's a bit ugly, the CPU doesn't quite match the GPU; the latter would take a lot of repetitive code that wouldn't be required if Metal didn't suck so hard). I've also improved one of the comments.

There's a draft of fine rasterization in the PR, I just haven't finished it. I should warn, while I think these can be adapted to reasonably performant fallback implementations, they're not currently optimized for performance; that might be downstream work, which I think makes sense after the pipeline has stabilized more. I think we may well do CPU fine rasterization, but there are a lot of details that are unclear, including whether we want to do it multithreaded and/or use SIMD.

Copy link
Collaborator

@armansito armansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making progress with the big CPU port. I reviewed all the shaders as carefully as I could and left a number of comments/suggestions/questions.

Aside from some questions I have around the subtle differences from the WGSL, I'm OK with merging this.

let draw_monoid = draw_monoids[element_ix];
let mut clip_bbox = [-1e9, -1e9, 1e9, 1e9];
if draw_monoid.clip_ix > 0 {
clip_bbox = clip_bbox_buf[draw_monoid.clip_ix as usize - 1];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might as well address the TODO in binning.wgsl and add this assertion here?: assert!(draw_monoid.clip_ix - 1 < config.layout.n_clip).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also lower the length of the slice to that value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That latter question is a deep one. It is clearly more idiomatic Rust, but I think the goal here is to match the GPU as much as possible, which to me reads as putting in the assert.

src/cpu_shader/clip_reduce.rs Outdated Show resolved Hide resolved
let mut tile_state = TileState::new(this_tile_ix);
let blend_offset = tile_state.cmd_offset;
tile_state.cmd_offset += 1;
// Discussion question: do these belong in tile state?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this comment referring to moving clip_depth and clip_zero_depth into the TileState struct? If so, they're mainly used for control flow here and I'm not sure if moving them there and manipulating them through methods would obscure the logic too much. I can imagine moving most of the DrawTag handling to TileState methods but I'm not sure about the utility of it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the comment. This was part of an internal dialog about whether to process one tile at a time (as the code currently does), or loop through chunks of 256 draw objects (which would be closer to the GPU model). If you were to go that direction, you'd need a struct to save tile state so you could suspend and then resume with the next chunk.

I'm reasonably satisfied with the current state of the code, as it's closer to the WGSL source, even if it doesn't match the execution model. The WGSL source looks like an ordinary scalar variable but is of course a set of VGPRs.


if bin_tile_x + tile_x < width_in_tiles && bin_tile_y + tile_y < height_in_tiles {
ptcl[tile_state.cmd_offset as usize] = CMD_END;
let scratch_size = 0; // TODO: actually compute
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing this is referring to the blend depth computation? Maybe mention "blend depth" in the comment so that it resonates with the TODO the BEGIN_CLIP case above.

let path_ix = !ix as i32;
clip_inp[m.clip_ix as usize] = Clip { ix, path_ix };
}
m = m.combine(&DrawMonoid::new(tag_word));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this line should move to line 38 above where the exclusive prefix sum gets stored into the draw_monoid buffer. In the GPU version the prefix sums get computed up front -- while the control flow is very different there it's nice that the logic is grouped together.

You could store m.path_ix and m.clip_ix into temporaries before combining the monoids. If you prefer to keep things the way they are then perhaps add a comment near line 34 that mentions that m gets prepared for the "next invocation" at the bottom of the loop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed it a slightly different way which I hope is clear, lemme know if you like it.

Comment on lines 28 to 32
let tag_raw = if ix < config.layout.n_draw_objects {
scene[(drawtag_base + ix) as usize]
} else {
0
};
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth defining a CPU version of read_draw_tag_from_scene in util.rs that you can reuse from here and draw_reduce?

let a = (b + 0.5 * (d * d - c * c) - xmin) / (xmax - xmin);
area[yi * TILE_WIDTH + i] += y_edge + a * dy;
}
} else if y_edge != 0.0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic (considering the nuance in line 85) all seems to be equivalent to the GPU version but I had to squint at the subtle branch shortcut here a bit before it made sense :) It's all good, it makes sense to avoid a NOP loop when running on the CPU.

let c = s0.x * sign - xt0;
let y0 = s0.y.floor();
let ytop = if s0.y == s1.y { s0.y.ceil() } else { y0 + 1.0 };
let b = dy_dxdy * c + a * (ytop - s0.y);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for the different formulations between the WGSL and Rust versions (dy_dxdy/idxdy, a, b, etc)? Should they be the same?

I'm also slightly confused about a. It's defined as 1.0 - dy / (dx + dy) here but as dx / (dx + dy) in WGSL. It's then used in the same calculations below. Are these equivalent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this question is incredibly deep water. The short answer is that you're looking at an experiment to improve numerical robustness. The ultimate solution to the artifact was to discard horizontal line segments aligned to grid boundaries before it gets to this point. What happened there is a = 1 - epsilon, b = 0, while you need a = 1 for conservative line rasterization to generate the right answer. In CPU land, you can use a division instead of multiplying by reciprocal to make that case work, and use the fact that dx/(dx + dy) + dy/(dx + dy) = 1 to avoid having to compute two divisions. In GPU land, WGSL doesn't give you that guarantee.

What I think is the best path going forward is to undo that experiment and make CPU match GPU. It's possible (especially with carefully constructed adversarial examples) that you might trigger incorrect behavior - in fact, this is one of the things I'll look at for #373. But you're absolutely right that CPU and GPU should match unless there's a compelling reason otherwise.

Comment on lines 40 to 41
let dy_dxdy = dy / (dx + dy);
let a = 1.0 - dy_dxdy;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same question as before (why the different formulations between WGSL and Rust?)

Copy link
Collaborator

@armansito armansito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the round trip and addressing comments. I think we can move forward with this PR unless others have any concerns.

@raphlinus raphlinus merged commit 9bdbb10 into main Oct 11, 2023
4 checks passed
@raphlinus raphlinus deleted the cpu_shader branch October 11, 2023 22:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants