`struct Dav1dLoopRestorationDSPContext`: Deduplicate via type-erased `fn` pointers #332

randomPoison · 2023-07-18T23:09:47Z

Alternative to #323 that approaches the problem of deduplicating bitdepth-dependent fn pointers by type-erasing the bitdepth-dependent parts. To accomplish this, the relevant extern "C" declarations were moved into looprestoration.rs, where pixel has been defined as c_void. For the 8bpc versions, an extra bitdepth_max: c_int argument has been added so that the 8bpc and 16bpc fns have the same signature, with the understanding that this extra argument will simply be ignored by the called asm routine. For fn pointers that are defined in Rust code, type-erased wrappers have been added..

thedataking · 2023-07-19T00:18:35Z

This PR seems to perform on par with main 👍 Detailed results below.

main on amy:

 Performance counter stats for 'target/release/dav1d_main -i /home/perl/.phoronix-test-suite/installed-tests/pts/dav1d-1.13.0/summer_nature_1080p.ivf -o /dev/null' (10 runs):

        439,199.50 msec task-clock                       #  133.922 CPUs utilized               ( +-  9.58% )
         2,562,019      context-switches                 #   10.615 K/sec                       ( +-  9.77% )
           297,627      cpu-migrations                   #    1.233 K/sec                       ( +-  9.17% )
           224,443      page-faults                      #  929.928 /sec                        ( +-  9.43% )
 2,126,995,320,699      cycles                           #    8.813 GHz                         ( +-  9.57% )  (33.25%)
     4,246,291,058      stalled-cycles-frontend          #    0.36% frontend cycles idle        ( +-  9.66% )  (33.37%)
   102,508,224,695      stalled-cycles-backend           #    8.76% backend cycles idle         ( +-  9.57% )  (33.41%)
 3,282,166,920,179      instructions                     #    2.81  insn per cycle
                                                  #    0.02  stalled cycles per insn     ( +-  9.57% )  (33.51%)
   365,381,120,153      branches                         #    1.514 G/sec                       ( +-  9.57% )  (33.61%)
    10,768,664,754      branch-misses                    #    5.36% of all branches             ( +-  9.58% )  (33.63%)
 1,173,796,106,150      L1-dcache-loads                  #    4.863 G/sec                       ( +-  9.57% )  (33.58%)
    55,074,125,113      L1-dcache-load-misses            #    8.53% of all L1-dcache accesses   ( +-  9.57% )  (33.55%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
   183,051,431,852      L1-icache-loads                  #  758.432 M/sec                       ( +-  9.57% )  (33.49%)
     2,143,450,143      L1-icache-load-misses            #    2.13% of all L1-icache accesses   ( +-  9.57% )  (33.45%)
    10,703,905,212      dTLB-loads                       #   44.349 M/sec                       ( +-  9.55% )  (33.44%)
       674,383,927      dTLB-load-misses                 #   11.41% of all dTLB cache accesses  ( +-  9.55% )  (33.37%)
       711,692,244      iTLB-loads                       #    2.949 M/sec                       ( +- 10.21% )  (33.36%)
        33,730,104      iTLB-load-misses                 #    8.54% of all iTLB cache accesses  ( +-  9.49% )  (33.32%)
    13,201,969,774      L1-dcache-prefetches             #   54.699 M/sec                       ( +-  9.58% )  (33.24%)
   <not supported>      L1-dcache-prefetch-misses

            3.2795 +- 0.0217 seconds time elapsed  ( +-  0.66% )

randompoison/looprestorationfilterfn-type-erased on amy:

 Performance counter stats for 'target/release/dav1d_type_erased -i /home/perl/.phoronix-test-suite/installed-tests/pts/dav1d-1.13.0/summer_nature_1080p.ivf -o /dev/null' (10 runs):

        437,835.21 msec task-clock                       #  134.771 CPUs utilized               ( +-  9.58% )
         2,645,153      context-switches                 #   10.971 K/sec                       ( +-  9.55% )
           284,775      cpu-migrations                   #    1.181 K/sec                       ( +-  9.58% )
           227,364      page-faults                      #  943.050 /sec                        ( +-  9.56% )
 2,113,482,139,391      cycles                           #    8.766 GHz                         ( +-  9.58% )  (33.16%)
     5,059,443,394      stalled-cycles-frontend          #    0.43% frontend cycles idle        ( +- 10.09% )  (33.16%)
    98,043,549,034      stalled-cycles-backend           #    8.42% backend cycles idle         ( +-  9.61% )  (33.23%)
 3,281,231,639,074      instructions                     #    2.82  insn per cycle
                                                  #    0.02  stalled cycles per insn     ( +-  9.57% )  (33.38%)
   365,642,223,275      branches                         #    1.517 G/sec                       ( +-  9.58% )  (33.54%)
    10,821,064,869      branch-misses                    #    5.38% of all branches             ( +-  9.59% )  (33.69%)
 1,174,299,641,060      L1-dcache-loads                  #    4.871 G/sec                       ( +-  9.58% )  (33.73%)
    55,123,903,195      L1-dcache-load-misses            #    8.54% of all L1-dcache accesses   ( +-  9.58% )  (33.70%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
   179,409,663,195      L1-icache-loads                  #  744.147 M/sec                       ( +-  9.58% )  (33.66%)
     3,985,377,741      L1-icache-load-misses            #    4.03% of all L1-icache accesses   ( +-  9.55% )  (33.55%)
    11,185,978,265      dTLB-loads                       #   46.397 M/sec                       ( +-  9.55% )  (33.45%)
       714,141,376      dTLB-load-misses                 #   11.75% of all dTLB cache accesses  ( +-  9.53% )  (33.42%)
       823,210,289      iTLB-loads                       #    3.414 M/sec                       ( +-  9.54% )  (33.37%)
        36,109,922      iTLB-load-misses                 #    7.99% of all iTLB cache accesses  ( +-  9.51% )  (33.27%)
    13,244,178,331      L1-dcache-prefetches             #   54.934 M/sec                       ( +-  9.56% )  (33.20%)
   <not supported>      L1-dcache-prefetch-misses

           3.24874 +- 0.00886 seconds time elapsed  ( +-  0.27% )

main on donna:

 Performance counter stats for 'target/release/dav1d_main -i /tmp/summer_nature_1080p.ivf -o /dev/null' (10 runs):

         72,951.04 msec task-clock                #   15.972 CPUs utilized            ( +-  0.29% )
           721,201      context-switches          #    9.816 K/sec                    ( +-  0.31% )
            18,389      cpu-migrations            #  250.286 /sec                     ( +-  0.91% )
            46,456      page-faults               #  632.295 /sec                     ( +-  0.40% )
   279,618,070,283      cycles                    #    3.806 GHz                      ( +-  0.48% )  (33.32%)
     1,872,558,545      stalled-cycles-frontend   #    0.66% frontend cycles idle     ( +-  0.27% )  (33.32%)
    25,072,428,701      stalled-cycles-backend    #    8.82% backend cycles idle      ( +-  1.27% )  (33.34%)
   360,214,494,844      instructions              #    1.27  insn per cycle
                                                  #    0.07  stalled cycles per insn  ( +-  0.09% )  (33.64%)
    38,087,857,819      branches                  #  518.400 M/sec                    ( +-  0.10% )  (33.50%)
     1,211,672,549      branch-misses             #    3.18% of all branches          ( +-  0.14% )  (33.76%)
   124,829,564,724      L1-dcache-loads           #    1.699 G/sec                    ( +-  0.09% )  (33.65%)
     5,525,318,196      L1-dcache-load-misses     #    4.43% of all L1-dcache accesses  ( +-  0.29% )  (33.50%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
    28,918,434,218      L1-icache-loads           #  393.598 M/sec                    ( +-  0.16% )  (33.29%)
       455,719,910      L1-icache-load-misses     #    1.58% of all L1-icache accesses  ( +-  0.22% )  (33.48%)
     1,770,037,275      dTLB-loads                #   24.091 M/sec                    ( +-  0.15% )  (33.34%)
       130,726,618      dTLB-load-misses          #    7.39% of all dTLB cache accesses  ( +-  0.15% )  (33.44%)
        66,392,191      iTLB-loads                #  903.639 K/sec                    ( +-  0.44% )  (33.56%)
         5,484,080      iTLB-load-misses          #    8.20% of all iTLB cache accesses  ( +-  1.21% )  (33.47%)
     1,844,991,747      L1-dcache-prefetches      #   25.111 M/sec                    ( +-  0.23% )  (33.38%)
   <not supported>      L1-dcache-prefetch-misses

           4.56737 +- 0.00785 seconds time elapsed  ( +-  0.17% )

randompoison/looprestorationfilterfn-type-erased on donna:

 Performance counter stats for 'target/release/dav1d_type_erased -i /tmp/summer_nature_1080p.ivf -o /dev/null' (10 runs):

         73,848.82 msec task-clock                #   16.112 CPUs utilized            ( +-  0.25% )
           724,702      context-switches          #    9.872 K/sec                    ( +-  0.22% )
            20,254      cpu-migrations            #  275.890 /sec                     ( +-  1.07% )
            46,199      page-faults               #  629.299 /sec                     ( +-  0.38% )
   276,235,019,232      cycles                    #    3.763 GHz                      ( +-  0.42% )  (33.45%)
     1,853,322,056      stalled-cycles-frontend   #    0.66% frontend cycles idle     ( +-  0.17% )  (33.35%)
    25,660,583,374      stalled-cycles-backend    #    9.18% backend cycles idle      ( +-  1.10% )  (33.41%)
   360,752,781,196      instructions              #    1.29  insn per cycle
                                                  #    0.07  stalled cycles per insn  ( +-  0.07% )  (33.36%)
    38,185,882,857      branches                  #  520.149 M/sec                    ( +-  0.06% )  (33.62%)
     1,207,991,991      branch-misses             #    3.16% of all branches          ( +-  0.18% )  (33.54%)
   124,920,165,377      L1-dcache-loads           #    1.702 G/sec                    ( +-  0.10% )  (33.66%)
     5,577,468,633      L1-dcache-load-misses     #    4.46% of all L1-dcache accesses  ( +-  0.12% )  (33.59%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
    28,789,384,267      L1-icache-loads           #  392.154 M/sec                    ( +-  0.08% )  (33.45%)
       500,723,677      L1-icache-load-misses     #    1.74% of all L1-icache accesses  ( +-  0.20% )  (33.34%)
     1,752,341,129      dTLB-loads                #   23.869 M/sec                    ( +-  0.16% )  (33.32%)
       129,635,661      dTLB-load-misses          #    7.34% of all dTLB cache accesses  ( +-  0.13% )  (33.30%)
        69,143,673      iTLB-loads                #  941.840 K/sec                    ( +-  7.50% )  (33.34%)
         5,684,039      iTLB-load-misses          #    9.76% of all iTLB cache accesses  ( +-  0.95% )  (33.49%)
     1,834,558,619      L1-dcache-prefetches      #   24.989 M/sec                    ( +-  0.30% )  (33.33%)
   <not supported>      L1-dcache-prefetch-misses

            4.5835 +- 0.0139 seconds time elapsed  ( +-  0.30% )

thedataking

LGTM 👍 I really like how this PR doesn't need a lot of macros; I find this style more readable.

kkysen · 2023-07-19T01:23:30Z

src/lr_apply_tmpl_16.rs

@@ -715,10 +715,10 @@ unsafe extern "C" fn lr_stripe(
                    & LR_HAVE_BOTTOM as libc::c_int as libc::c_uint,
        );
        lr_fn.expect("non-null function pointer")(
-            p,
+            p.cast(),


We'll want to put this behind a typed interface eventually. I think when we genericize over bitdepth is a good time to do that.

src/looprestoration_tmpl_16.rs

src/looprestoration.rs

kkysen · 2023-07-19T03:37:49Z

src/looprestoration.rs

+// TODO(randomPoison): Temporarily pub until init fns are deduplicated.
+#[cfg(all(
+    feature = "bitdepth_8",
+    feature = "asm",
+    any(target_arch = "x86", target_arch = "x86_64"),
+))]
+extern "C" {
+    pub(crate) fn dav1d_wiener_filter7_8bpc_sse2(
+        dst: *mut pixel,


Do we want to use a macro to generate these? Is that worth it? It'd look much cleaner and the C version uses macros for this, too. We can do this in a follow-up PR, though.

Do we want to use a macro to generate these?

IMO it is best to avoid compactifying code using macros. If often interferes with stuff like go-to-definition, grepping for a symbol, and debugging (which is why we need aids like -Zunpretty=expanded).

Oh, before I saw these comments I went and rewrote this to use a macro 😅

@thedataking I understand your concerns with using macros generally, but in this case I think it's worth it. In addition to making our code more compact, using a macro also has the advantage that it makes it much easier to see which functions all share the same signature, which I think is a big win for readability of the code.

kkysen

This looks pretty great to me. I left just a few small comments to fix before merging (can we squash to keep all commits to main CI-passing?), as well as a few comments about further improvements for later PRs and once bitdepth genericization is done.

kkysen · 2023-07-19T19:11:09Z

src/looprestoration.rs

+macro_rules! extern_fn {
+    ( $( $name:ident, )* ) => {
+        extern "C" {
+            $(
+                // TODO(randomPoison): Temporarily pub until init fns are deduplicated.
+                pub(crate) fn $name(
+                    dst: *mut pixel,
+                    dst_stride: ptrdiff_t,
+                    left: const_left_pixel_row,
+                    lpf: *const pixel,
+                    w: libc::c_int,
+                    h: libc::c_int,
+                    params: *const LooprestorationParams,
+                    edges: LrEdgeFlags,
+                    bitdepth_max: libc::c_int,
+                );
+            )*
+        }
+    };
+}
+
+#[cfg(all(
+    feature = "bitdepth_8",
+    feature = "asm",
+    any(target_arch = "x86", target_arch = "x86_64"),
+))]
+extern_fn! {
+    dav1d_wiener_filter7_8bpc_sse2,
+    dav1d_wiener_filter5_8bpc_sse2,


To help with @thedataking's concerns for using a macro for this (#332 (comment)), how about we make the macro something like decl_looprestoration_fn!(fn dav1d_wiener_filter7_8bpc_sse2);? This way it's

closer to the C version

declares one fn per macro call, so go-to-definition is easier to use

keeps the fn $name together so that it's more easily greppable

I don't think there's a big advantage to only declaring one fn per macro invocation, and I think it mostly increases visual noise. However, I do think doing fn $name is a good idea in order to make it easier to find the fn declarations by grepping 👍

One-per-invocation does have some important advantages, though. When the macro invocation has an error, it's very difficult to tell from which part of it the error comes from (e.x. https://github.com/immunant/c2rust/blob/0021ec979663b96b265ebf4995b601220bfbac32/c2rust-analyze/src/known_fn.rs#L570-L629). If we have separate macro invocations, it's super clear which invocation causes the error.

Similarly, with go-to-definition, it generally will work much better if there are separate invocations, though I have noticed in some cases that rust-analyzer is smart enough to figure it out still (e.x. https://github.com/immunant/c2rust/blob/0021ec979663b96b265ebf4995b601220bfbac32/c2rust-analyze/tests/filecheck.rs#L33-L65).

Appreciate the effort to address my concerns. That said, what does the macro buy you? Code compactness and not much else. Code compactness takes a backseat to readability, debuggability, greppability, etc. IMHO.

I often grep for fn dav1d_wiener_filter7_8bpc_sse2 or some variation thereof, maybe fn dav1d_wiener_filter7_8bpc if I only want to see function definitions. Please (please), consider using macros sparingly. C uses macros and typedefs extensively as a way to paper over portability issues. Rust often provides portable types and APIs so the reasonable expectation is for Rust code to use macros less.

A good use of macros is when you have a "can't see the forrest for all the trees" situation where you can't see the overall pattern of the code for the sheer repetitiveness. I think we had situations like that in decode.rs where using macros helped. I just don't think we have the same situation here.

Thanks for hearing out my concerns. Now that I've voiced my concerns, I'll extricate myself from this discussion.

Yeah, it seems like rust-analyzer is able to figure it out in this case, likely because the macro is so simple.

Another reason to keep the current version is that it makes it easier to group fns by their #[cfg] attribute. Doing one macro invocation per function means that the #[cfg] attributes would need to be duplicated a lot, which hurts readability.

I think what I have currently is cleaner than the proposed alternative, but if you feel strongly that we should go with a different approach then I can make the change.

Another reason to keep the current version is that it makes it easier to group fns by their #[cfg] attribute. Doing one macro invocation per function means that the #[cfg] attributes would need to be duplicated a lot, which hurts readability.

I hadn't thought of that. That definitely helps a lot then. But the macro invocation per function could also be done by putting the extern "C" block outside of the macro invocations, where we could add the #[cfg]s.

if you feel strongly that we should go with a different approach then I can make the change

I'm fine with what you have, keeping my LGTM. I wasn't factoring in that the macro helped you group fns by cfg. Also good to know that grepping and navigation is expected to work 👍

Appreciate the effort to address my concerns. That said, what does the macro buy you? Code compactness and not much else. Code compactness takes a backseat to readability, debuggability, greppability, etc. IMHO.

I often grep for fn dav1d_wiener_filter7_8bpc_sse2 or some variation thereof, maybe fn dav1d_wiener_filter7_8bpc if I only want to see function definitions. Please (please), consider using macros sparingly. C uses macros and typedefs extensively as a way to paper over portability issues. Rust often provides portable types and APIs so the reasonable expectation is for Rust code to use macros less.

A good use of macros is when you have a "can't see the forrest for all the trees" situation where you can't see the overall pattern of the code for the sheer repetitiveness. I think we had situations like that in decode.rs where using macros helped. I just don't think we have the same situation here.

To me, the macro here is more readable because it lets you clearly see that every function has the same signature. That's obscured when you see a wall of code. There's not that much use in reading each function individually as they're all just variations of the same function. The macros we had in the devirtualization PRs were a lot more complex, but I think this one is very simple and makes it easier for me to read, and I think for @randomPoison, too.

src/looprestoration.rs

randomPoison added 5 commits July 18, 2023 14:08

Deduplicate Dav1dLoopRestorationDSPContext

20f483e

Use type-erased fn pointers

a006153

Fixup arm asm fns

98f0c3a

Fix formatting

869ea73

Add TODOs for temporary pub items

d37d6b8

randomPoison requested review from thedataking and kkysen July 18, 2023 23:09

thedataking approved these changes Jul 19, 2023

View reviewed changes