Add CPU shaders #374

raphlinus · 2023-10-07T23:35:18Z

This commit adds a full pipeline of CPU shaders for the compute stage of the pipeline (not counting fine rasterization). It's based on what's in the multi branch, but also ports bounding box fixes and other cleanups.

Currently the selection of shaders is all or nothing (based on the use-cpu flag), but will probably become more fine-grained.

Going forward, the intent is to keep CPU and GPU shaders in sync. It has been quite valuable in diagnosing problems in the pipeline.

This commit adds a full pipeline of CPU shaders for the compute stage of the pipeline (not counting fine rasterization). It's based on what's in the multi branch, but also ports bounding box fixes and other cleanups. Currently the selection of shaders is all or nothing (based on the use-cpu flag), but will probably become more fine-grained. Going forward, the intent is to keep CPU and GPU shaders in sync. It has been quite valuable in diagnosing problems in the pipeline.

eliasnaur · 2023-10-10T15:35:40Z

Can the CPU shaders be UNLICENSE too, like their GPU counterparts? I'm targeting another port of Vello into the Gio graphics pipeline and one of the features we absolutely need is a CPU fallback.

not counting fine rasterization

May I ask what the plans are for the fine rasterization stage? Will Vello never run on platforms without GPU support for compute?

DJMcNab

A few comments. I've not read through the shader code

crates/encoding/src/path.rs

DJMcNab · 2023-10-10T06:31:57Z

src/cpu_dispatch.rs

+}
+
+pub enum TypedBufGuardMut<'a, T: ?Sized> {
+    Slice(&'a mut T),


Is this variant ever constructed?

I don't think so, I'd have to check.

DJMcNab · 2023-10-10T18:52:35Z

src/shaders.rs

+    /// Second, the actual mapping to CPU shaders is not really specific to
+    /// the engine, and should be split out into a back-end agnostic struct.
+    pub fn install_cpu_shaders(&mut self, engine: &mut WgpuEngine) {
+        engine.set_cpu_shader(self.pathtag_reduce, cpu_shader::pathtag_reduce);


Ooh neat - this is much less noisy

DJMcNab · 2023-10-10T18:59:02Z

src/cpu_shader/util.rs

+impl Transform {
+    pub fn apply(&self, p: Vec2) -> Vec2 {
+        let z = self.0;
+        let x = z[0] * p.x + z[2] * p.y + z[4];


I have not checked these operations, but am reminded of #283

Adds Unlicense to the SPDX line, and also relaxes the 64k (or any) limit on the number of pathtags in the CPU case. The CPU shaders don't exactly match the GPU in the "use_large_path_scan" case, but they do work.

raphlinus · 2023-10-10T23:21:24Z

Ok, I've fixed the crashes with large pathtags (though it's a bit ugly, the CPU doesn't quite match the GPU; the latter would take a lot of repetitive code that wouldn't be required if Metal didn't suck so hard). I've also improved one of the comments.

There's a draft of fine rasterization in the PR, I just haven't finished it. I should warn, while I think these can be adapted to reasonably performant fallback implementations, they're not currently optimized for performance; that might be downstream work, which I think makes sense after the pipeline has stabilized more. I think we may well do CPU fine rasterization, but there are a lot of details that are unclear, including whether we want to do it multithreaded and/or use SIMD.

armansito

Thank you for making progress with the big CPU port. I reviewed all the shaders as carefully as I could and left a number of comments/suggestions/questions.

Aside from some questions I have around the subtle differences from the WGSL, I'm OK with merging this.

armansito · 2023-10-11T00:57:21Z

src/cpu_shader/binning.rs

+                let draw_monoid = draw_monoids[element_ix];
+                let mut clip_bbox = [-1e9, -1e9, 1e9, 1e9];
+                if draw_monoid.clip_ix > 0 {
+                    clip_bbox = clip_bbox_buf[draw_monoid.clip_ix as usize - 1];


Might as well address the TODO in binning.wgsl and add this assertion here?: assert!(draw_monoid.clip_ix - 1 < config.layout.n_clip).

Could we also lower the length of the slice to that value?

That latter question is a deep one. It is clearly more idiomatic Rust, but I think the goal here is to match the GPU as much as possible, which to me reads as putting in the assert.

src/cpu_shader/clip_reduce.rs

armansito · 2023-10-11T04:15:54Z

src/cpu_shader/coarse.rs

+            let mut tile_state = TileState::new(this_tile_ix);
+            let blend_offset = tile_state.cmd_offset;
+            tile_state.cmd_offset += 1;
+            // Discussion question: do these belong in tile state?


Is this comment referring to moving clip_depth and clip_zero_depth into the TileState struct? If so, they're mainly used for control flow here and I'm not sure if moving them there and manipulating them through methods would obscure the logic too much. I can imagine moving most of the DrawTag handling to TileState methods but I'm not sure about the utility of it.

I've removed the comment. This was part of an internal dialog about whether to process one tile at a time (as the code currently does), or loop through chunks of 256 draw objects (which would be closer to the GPU model). If you were to go that direction, you'd need a struct to save tile state so you could suspend and then resume with the next chunk.

I'm reasonably satisfied with the current state of the code, as it's closer to the WGSL source, even if it doesn't match the execution model. The WGSL source looks like an ordinary scalar variable but is of course a set of VGPRs.

armansito · 2023-10-11T04:18:57Z

src/cpu_shader/coarse.rs

+
+            if bin_tile_x + tile_x < width_in_tiles && bin_tile_y + tile_y < height_in_tiles {
+                ptcl[tile_state.cmd_offset as usize] = CMD_END;
+                let scratch_size = 0; // TODO: actually compute


I'm guessing this is referring to the blend depth computation? Maybe mention "blend depth" in the comment so that it resonates with the TODO the BEGIN_CLIP case above.

armansito · 2023-10-11T04:34:33Z

src/cpu_shader/draw_leaf.rs

+                let path_ix = !ix as i32;
+                clip_inp[m.clip_ix as usize] = Clip { ix, path_ix };
+            }
+            m = m.combine(&DrawMonoid::new(tag_word));


I wonder if this line should move to line 38 above where the exclusive prefix sum gets stored into the draw_monoid buffer. In the GPU version the prefix sums get computed up front -- while the control flow is very different there it's nice that the logic is grouped together.

You could store m.path_ix and m.clip_ix into temporaries before combining the monoids. If you prefer to keep things the way they are then perhaps add a comment near line 34 that mentions that m gets prepared for the "next invocation" at the bottom of the loop.

I've changed it a slightly different way which I hope is clear, lemme know if you like it.

armansito · 2023-10-11T04:39:49Z

src/cpu_shader/draw_leaf.rs

+            let tag_raw = if ix < config.layout.n_draw_objects {
+                scene[(drawtag_base + ix) as usize]
+            } else {
+                0
+            };


Is it worth defining a CPU version of read_draw_tag_from_scene in util.rs that you can reuse from here and draw_reduce?

armansito · 2023-10-11T04:57:44Z

src/cpu_shader/fine.rs

+                    let a = (b + 0.5 * (d * d - c * c) - xmin) / (xmax - xmin);
+                    area[yi * TILE_WIDTH + i] += y_edge + a * dy;
+                }
+            } else if y_edge != 0.0 {


The logic (considering the nuance in line 85) all seems to be equivalent to the GPU version but I had to squint at the subtle branch shortcut here a bit before it made sense :) It's all good, it makes sense to avoid a NOP loop when running on the CPU.

armansito · 2023-10-11T05:30:01Z

src/cpu_shader/path_count.rs

+        let c = s0.x * sign - xt0;
+        let y0 = s0.y.floor();
+        let ytop = if s0.y == s1.y { s0.y.ceil() } else { y0 + 1.0 };
+        let b = dy_dxdy * c + a * (ytop - s0.y);


Is there a reason for the different formulations between the WGSL and Rust versions (dy_dxdy/idxdy, a, b, etc)? Should they be the same?

I'm also slightly confused about a. It's defined as 1.0 - dy / (dx + dy) here but as dx / (dx + dy) in WGSL. It's then used in the same calculations below. Are these equivalent?

Ok, this question is incredibly deep water. The short answer is that you're looking at an experiment to improve numerical robustness. The ultimate solution to the artifact was to discard horizontal line segments aligned to grid boundaries before it gets to this point. What happened there is a = 1 - epsilon, b = 0, while you need a = 1 for conservative line rasterization to generate the right answer. In CPU land, you can use a division instead of multiplying by reciprocal to make that case work, and use the fact that dx/(dx + dy) + dy/(dx + dy) = 1 to avoid having to compute two divisions. In GPU land, WGSL doesn't give you that guarantee.

What I think is the best path going forward is to undo that experiment and make CPU match GPU. It's possible (especially with carefully constructed adversarial examples) that you might trigger incorrect behavior - in fact, this is one of the things I'll look at for #373. But you're absolutely right that CPU and GPU should match unless there's a compelling reason otherwise.

armansito · 2023-10-11T05:39:28Z

src/cpu_shader/path_tiling.rs

+        let dy_dxdy = dy / (dx + dy);
+        let a = 1.0 - dy_dxdy;


I have the same question as before (why the different formulations between WGSL and Rust?)

armansito

Thank you for the round trip and addressing comments. I think we can move forward with this PR unless others have any concerns.

DJMcNab reviewed Oct 10, 2023

View reviewed changes

Fix pathtag limit, add unlicense

5080af0

Adds Unlicense to the SPDX line, and also relaxes the 64k (or any) limit on the number of pathtags in the CPU case. The CPU shaders don't exactly match the GPU in the "use_large_path_scan" case, but they do work.

armansito reviewed Oct 11, 2023

View reviewed changes

Address review feedback

3a43538

armansito approved these changes Oct 11, 2023

View reviewed changes

raphlinus merged commit 9bdbb10 into main Oct 11, 2023
4 checks passed

raphlinus deleted the cpu_shader branch October 11, 2023 22:57

raphlinus mentioned this pull request Oct 11, 2023

[CPU shaders] Assertion failure in path_tiling stage when rendering Coat of Arms of Poland.svg #378

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CPU shaders #374

Add CPU shaders #374

raphlinus commented Oct 7, 2023

eliasnaur commented Oct 10, 2023

DJMcNab left a comment

DJMcNab Oct 10, 2023

raphlinus Oct 10, 2023

DJMcNab Oct 10, 2023

DJMcNab Oct 10, 2023

raphlinus commented Oct 10, 2023

armansito left a comment

armansito Oct 11, 2023

DJMcNab Oct 11, 2023

raphlinus Oct 11, 2023

armansito Oct 11, 2023

raphlinus Oct 11, 2023

armansito Oct 11, 2023

armansito Oct 11, 2023

raphlinus Oct 11, 2023

armansito Oct 11, 2023

armansito Oct 11, 2023

armansito Oct 11, 2023

raphlinus Oct 11, 2023

armansito Oct 11, 2023

armansito left a comment

Add CPU shaders #374

Add CPU shaders #374

Conversation

raphlinus commented Oct 7, 2023

eliasnaur commented Oct 10, 2023

DJMcNab left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

raphlinus commented Oct 10, 2023

armansito left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

armansito left a comment

Choose a reason for hiding this comment