Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

struct Dav1dLoopRestorationDSPContext: Deduplicate via type-erased fn pointers #332

Merged
merged 8 commits into from
Jul 20, 2023

Conversation

randomPoison
Copy link
Collaborator

Alternative to #323 that approaches the problem of deduplicating bitdepth-dependent fn pointers by type-erasing the bitdepth-dependent parts. To accomplish this, the relevant extern "C" declarations were moved into looprestoration.rs, where pixel has been defined as c_void. For the 8bpc versions, an extra bitdepth_max: c_int argument has been added so that the 8bpc and 16bpc fns have the same signature, with the understanding that this extra argument will simply be ignored by the called asm routine. For fn pointers that are defined in Rust code, type-erased wrappers have been added..

@thedataking
Copy link
Collaborator

thedataking commented Jul 19, 2023

This PR seems to perform on par with main 👍 Detailed results below.

main on amy:

 Performance counter stats for 'target/release/dav1d_main -i /home/perl/.phoronix-test-suite/installed-tests/pts/dav1d-1.13.0/summer_nature_1080p.ivf -o /dev/null' (10 runs):

        439,199.50 msec task-clock                       #  133.922 CPUs utilized               ( +-  9.58% )
         2,562,019      context-switches                 #   10.615 K/sec                       ( +-  9.77% )
           297,627      cpu-migrations                   #    1.233 K/sec                       ( +-  9.17% )
           224,443      page-faults                      #  929.928 /sec                        ( +-  9.43% )
 2,126,995,320,699      cycles                           #    8.813 GHz                         ( +-  9.57% )  (33.25%)
     4,246,291,058      stalled-cycles-frontend          #    0.36% frontend cycles idle        ( +-  9.66% )  (33.37%)
   102,508,224,695      stalled-cycles-backend           #    8.76% backend cycles idle         ( +-  9.57% )  (33.41%)
 3,282,166,920,179      instructions                     #    2.81  insn per cycle
                                                  #    0.02  stalled cycles per insn     ( +-  9.57% )  (33.51%)
   365,381,120,153      branches                         #    1.514 G/sec                       ( +-  9.57% )  (33.61%)
    10,768,664,754      branch-misses                    #    5.36% of all branches             ( +-  9.58% )  (33.63%)
 1,173,796,106,150      L1-dcache-loads                  #    4.863 G/sec                       ( +-  9.57% )  (33.58%)
    55,074,125,113      L1-dcache-load-misses            #    8.53% of all L1-dcache accesses   ( +-  9.57% )  (33.55%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
   183,051,431,852      L1-icache-loads                  #  758.432 M/sec                       ( +-  9.57% )  (33.49%)
     2,143,450,143      L1-icache-load-misses            #    2.13% of all L1-icache accesses   ( +-  9.57% )  (33.45%)
    10,703,905,212      dTLB-loads                       #   44.349 M/sec                       ( +-  9.55% )  (33.44%)
       674,383,927      dTLB-load-misses                 #   11.41% of all dTLB cache accesses  ( +-  9.55% )  (33.37%)
       711,692,244      iTLB-loads                       #    2.949 M/sec                       ( +- 10.21% )  (33.36%)
        33,730,104      iTLB-load-misses                 #    8.54% of all iTLB cache accesses  ( +-  9.49% )  (33.32%)
    13,201,969,774      L1-dcache-prefetches             #   54.699 M/sec                       ( +-  9.58% )  (33.24%)
   <not supported>      L1-dcache-prefetch-misses

            3.2795 +- 0.0217 seconds time elapsed  ( +-  0.66% )

randompoison/looprestorationfilterfn-type-erased on amy:

 Performance counter stats for 'target/release/dav1d_type_erased -i /home/perl/.phoronix-test-suite/installed-tests/pts/dav1d-1.13.0/summer_nature_1080p.ivf -o /dev/null' (10 runs):

        437,835.21 msec task-clock                       #  134.771 CPUs utilized               ( +-  9.58% )
         2,645,153      context-switches                 #   10.971 K/sec                       ( +-  9.55% )
           284,775      cpu-migrations                   #    1.181 K/sec                       ( +-  9.58% )
           227,364      page-faults                      #  943.050 /sec                        ( +-  9.56% )
 2,113,482,139,391      cycles                           #    8.766 GHz                         ( +-  9.58% )  (33.16%)
     5,059,443,394      stalled-cycles-frontend          #    0.43% frontend cycles idle        ( +- 10.09% )  (33.16%)
    98,043,549,034      stalled-cycles-backend           #    8.42% backend cycles idle         ( +-  9.61% )  (33.23%)
 3,281,231,639,074      instructions                     #    2.82  insn per cycle
                                                  #    0.02  stalled cycles per insn     ( +-  9.57% )  (33.38%)
   365,642,223,275      branches                         #    1.517 G/sec                       ( +-  9.58% )  (33.54%)
    10,821,064,869      branch-misses                    #    5.38% of all branches             ( +-  9.59% )  (33.69%)
 1,174,299,641,060      L1-dcache-loads                  #    4.871 G/sec                       ( +-  9.58% )  (33.73%)
    55,123,903,195      L1-dcache-load-misses            #    8.54% of all L1-dcache accesses   ( +-  9.58% )  (33.70%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
   179,409,663,195      L1-icache-loads                  #  744.147 M/sec                       ( +-  9.58% )  (33.66%)
     3,985,377,741      L1-icache-load-misses            #    4.03% of all L1-icache accesses   ( +-  9.55% )  (33.55%)
    11,185,978,265      dTLB-loads                       #   46.397 M/sec                       ( +-  9.55% )  (33.45%)
       714,141,376      dTLB-load-misses                 #   11.75% of all dTLB cache accesses  ( +-  9.53% )  (33.42%)
       823,210,289      iTLB-loads                       #    3.414 M/sec                       ( +-  9.54% )  (33.37%)
        36,109,922      iTLB-load-misses                 #    7.99% of all iTLB cache accesses  ( +-  9.51% )  (33.27%)
    13,244,178,331      L1-dcache-prefetches             #   54.934 M/sec                       ( +-  9.56% )  (33.20%)
   <not supported>      L1-dcache-prefetch-misses

           3.24874 +- 0.00886 seconds time elapsed  ( +-  0.27% )

main on donna:

 Performance counter stats for 'target/release/dav1d_main -i /tmp/summer_nature_1080p.ivf -o /dev/null' (10 runs):

         72,951.04 msec task-clock                #   15.972 CPUs utilized            ( +-  0.29% )
           721,201      context-switches          #    9.816 K/sec                    ( +-  0.31% )
            18,389      cpu-migrations            #  250.286 /sec                     ( +-  0.91% )
            46,456      page-faults               #  632.295 /sec                     ( +-  0.40% )
   279,618,070,283      cycles                    #    3.806 GHz                      ( +-  0.48% )  (33.32%)
     1,872,558,545      stalled-cycles-frontend   #    0.66% frontend cycles idle     ( +-  0.27% )  (33.32%)
    25,072,428,701      stalled-cycles-backend    #    8.82% backend cycles idle      ( +-  1.27% )  (33.34%)
   360,214,494,844      instructions              #    1.27  insn per cycle
                                                  #    0.07  stalled cycles per insn  ( +-  0.09% )  (33.64%)
    38,087,857,819      branches                  #  518.400 M/sec                    ( +-  0.10% )  (33.50%)
     1,211,672,549      branch-misses             #    3.18% of all branches          ( +-  0.14% )  (33.76%)
   124,829,564,724      L1-dcache-loads           #    1.699 G/sec                    ( +-  0.09% )  (33.65%)
     5,525,318,196      L1-dcache-load-misses     #    4.43% of all L1-dcache accesses  ( +-  0.29% )  (33.50%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
    28,918,434,218      L1-icache-loads           #  393.598 M/sec                    ( +-  0.16% )  (33.29%)
       455,719,910      L1-icache-load-misses     #    1.58% of all L1-icache accesses  ( +-  0.22% )  (33.48%)
     1,770,037,275      dTLB-loads                #   24.091 M/sec                    ( +-  0.15% )  (33.34%)
       130,726,618      dTLB-load-misses          #    7.39% of all dTLB cache accesses  ( +-  0.15% )  (33.44%)
        66,392,191      iTLB-loads                #  903.639 K/sec                    ( +-  0.44% )  (33.56%)
         5,484,080      iTLB-load-misses          #    8.20% of all iTLB cache accesses  ( +-  1.21% )  (33.47%)
     1,844,991,747      L1-dcache-prefetches      #   25.111 M/sec                    ( +-  0.23% )  (33.38%)
   <not supported>      L1-dcache-prefetch-misses

           4.56737 +- 0.00785 seconds time elapsed  ( +-  0.17% )

randompoison/looprestorationfilterfn-type-erased on donna:

 Performance counter stats for 'target/release/dav1d_type_erased -i /tmp/summer_nature_1080p.ivf -o /dev/null' (10 runs):

         73,848.82 msec task-clock                #   16.112 CPUs utilized            ( +-  0.25% )
           724,702      context-switches          #    9.872 K/sec                    ( +-  0.22% )
            20,254      cpu-migrations            #  275.890 /sec                     ( +-  1.07% )
            46,199      page-faults               #  629.299 /sec                     ( +-  0.38% )
   276,235,019,232      cycles                    #    3.763 GHz                      ( +-  0.42% )  (33.45%)
     1,853,322,056      stalled-cycles-frontend   #    0.66% frontend cycles idle     ( +-  0.17% )  (33.35%)
    25,660,583,374      stalled-cycles-backend    #    9.18% backend cycles idle      ( +-  1.10% )  (33.41%)
   360,752,781,196      instructions              #    1.29  insn per cycle
                                                  #    0.07  stalled cycles per insn  ( +-  0.07% )  (33.36%)
    38,185,882,857      branches                  #  520.149 M/sec                    ( +-  0.06% )  (33.62%)
     1,207,991,991      branch-misses             #    3.16% of all branches          ( +-  0.18% )  (33.54%)
   124,920,165,377      L1-dcache-loads           #    1.702 G/sec                    ( +-  0.10% )  (33.66%)
     5,577,468,633      L1-dcache-load-misses     #    4.46% of all L1-dcache accesses  ( +-  0.12% )  (33.59%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
    28,789,384,267      L1-icache-loads           #  392.154 M/sec                    ( +-  0.08% )  (33.45%)
       500,723,677      L1-icache-load-misses     #    1.74% of all L1-icache accesses  ( +-  0.20% )  (33.34%)
     1,752,341,129      dTLB-loads                #   23.869 M/sec                    ( +-  0.16% )  (33.32%)
       129,635,661      dTLB-load-misses          #    7.34% of all dTLB cache accesses  ( +-  0.13% )  (33.30%)
        69,143,673      iTLB-loads                #  941.840 K/sec                    ( +-  7.50% )  (33.34%)
         5,684,039      iTLB-load-misses          #    9.76% of all iTLB cache accesses  ( +-  0.95% )  (33.49%)
     1,834,558,619      L1-dcache-prefetches      #   24.989 M/sec                    ( +-  0.30% )  (33.33%)
   <not supported>      L1-dcache-prefetch-misses

            4.5835 +- 0.0139 seconds time elapsed  ( +-  0.30% )

Copy link
Collaborator

@thedataking thedataking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 I really like how this PR doesn't need a lot of macros; I find this style more readable.

@@ -715,10 +715,10 @@ unsafe extern "C" fn lr_stripe(
& LR_HAVE_BOTTOM as libc::c_int as libc::c_uint,
);
lr_fn.expect("non-null function pointer")(
p,
p.cast(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll want to put this behind a typed interface eventually. I think when we genericize over bitdepth is a good time to do that.

Comment on lines 52 to 60
// TODO(randomPoison): Temporarily pub until init fns are deduplicated.
#[cfg(all(
feature = "bitdepth_8",
feature = "asm",
any(target_arch = "x86", target_arch = "x86_64"),
))]
extern "C" {
pub(crate) fn dav1d_wiener_filter7_8bpc_sse2(
dst: *mut pixel,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use a macro to generate these? Is that worth it? It'd look much cleaner and the C version uses macros for this, too. We can do this in a follow-up PR, though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to use a macro to generate these?

IMO it is best to avoid compactifying code using macros. If often interferes with stuff like go-to-definition, grepping for a symbol, and debugging (which is why we need aids like -Zunpretty=expanded).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, before I saw these comments I went and rewrote this to use a macro 😅

@thedataking I understand your concerns with using macros generally, but in this case I think it's worth it. In addition to making our code more compact, using a macro also has the advantage that it makes it much easier to see which functions all share the same signature, which I think is a big win for readability of the code.

Copy link
Collaborator

@kkysen kkysen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty great to me. I left just a few small comments to fix before merging (can we squash to keep all commits to main CI-passing?), as well as a few comments about further improvements for later PRs and once bitdepth genericization is done.

@kkysen kkysen changed the title Deduplicate Dav1dLoopRestorationDSPContext via type erased fn pointers struct Dav1dLoopRestorationDSPContext: Deduplicate via type-erased fn pointers Jul 19, 2023
Comment on lines 52 to 80
macro_rules! extern_fn {
( $( $name:ident, )* ) => {
extern "C" {
$(
// TODO(randomPoison): Temporarily pub until init fns are deduplicated.
pub(crate) fn $name(
dst: *mut pixel,
dst_stride: ptrdiff_t,
left: const_left_pixel_row,
lpf: *const pixel,
w: libc::c_int,
h: libc::c_int,
params: *const LooprestorationParams,
edges: LrEdgeFlags,
bitdepth_max: libc::c_int,
);
)*
}
};
}

#[cfg(all(
feature = "bitdepth_8",
feature = "asm",
any(target_arch = "x86", target_arch = "x86_64"),
))]
extern_fn! {
dav1d_wiener_filter7_8bpc_sse2,
dav1d_wiener_filter5_8bpc_sse2,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To help with @thedataking's concerns for using a macro for this (#332 (comment)), how about we make the macro something like decl_looprestoration_fn!(fn dav1d_wiener_filter7_8bpc_sse2);? This way it's

  • closer to the C version
  • declares one fn per macro call, so go-to-definition is easier to use
  • keeps the fn $name together so that it's more easily greppable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a big advantage to only declaring one fn per macro invocation, and I think it mostly increases visual noise. However, I do think doing fn $name is a good idea in order to make it easier to find the fn declarations by grepping 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One-per-invocation does have some important advantages, though. When the macro invocation has an error, it's very difficult to tell from which part of it the error comes from (e.x. https://github.com/immunant/c2rust/blob/0021ec979663b96b265ebf4995b601220bfbac32/c2rust-analyze/src/known_fn.rs#L570-L629). If we have separate macro invocations, it's super clear which invocation causes the error.

Similarly, with go-to-definition, it generally will work much better if there are separate invocations, though I have noticed in some cases that rust-analyzer is smart enough to figure it out still (e.x. https://github.com/immunant/c2rust/blob/0021ec979663b96b265ebf4995b601220bfbac32/c2rust-analyze/tests/filecheck.rs#L33-L65).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the effort to address my concerns. That said, what does the macro buy you? Code compactness and not much else. Code compactness takes a backseat to readability, debuggability, greppability, etc. IMHO.

I often grep for fn dav1d_wiener_filter7_8bpc_sse2 or some variation thereof, maybe fn dav1d_wiener_filter7_8bpc if I only want to see function definitions. Please (please), consider using macros sparingly. C uses macros and typedefs extensively as a way to paper over portability issues. Rust often provides portable types and APIs so the reasonable expectation is for Rust code to use macros less.

A good use of macros is when you have a "can't see the forrest for all the trees" situation where you can't see the overall pattern of the code for the sheer repetitiveness. I think we had situations like that in decode.rs where using macros helped. I just don't think we have the same situation here.

Thanks for hearing out my concerns. Now that I've voiced my concerns, I'll extricate myself from this discussion.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it seems like rust-analyzer is able to figure it out in this case, likely because the macro is so simple.

Another reason to keep the current version is that it makes it easier to group fns by their #[cfg] attribute. Doing one macro invocation per function means that the #[cfg] attributes would need to be duplicated a lot, which hurts readability.

I think what I have currently is cleaner than the proposed alternative, but if you feel strongly that we should go with a different approach then I can make the change.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another reason to keep the current version is that it makes it easier to group fns by their #[cfg] attribute. Doing one macro invocation per function means that the #[cfg] attributes would need to be duplicated a lot, which hurts readability.

I hadn't thought of that. That definitely helps a lot then. But the macro invocation per function could also be done by putting the extern "C" block outside of the macro invocations, where we could add the #[cfg]s.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you feel strongly that we should go with a different approach then I can make the change

I'm fine with what you have, keeping my LGTM. I wasn't factoring in that the macro helped you group fns by cfg. Also good to know that grepping and navigation is expected to work 👍

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate the effort to address my concerns. That said, what does the macro buy you? Code compactness and not much else. Code compactness takes a backseat to readability, debuggability, greppability, etc. IMHO.

I often grep for fn dav1d_wiener_filter7_8bpc_sse2 or some variation thereof, maybe fn dav1d_wiener_filter7_8bpc if I only want to see function definitions. Please (please), consider using macros sparingly. C uses macros and typedefs extensively as a way to paper over portability issues. Rust often provides portable types and APIs so the reasonable expectation is for Rust code to use macros less.

A good use of macros is when you have a "can't see the forrest for all the trees" situation where you can't see the overall pattern of the code for the sheer repetitiveness. I think we had situations like that in decode.rs where using macros helped. I just don't think we have the same situation here.

To me, the macro here is more readable because it lets you clearly see that every function has the same signature. That's obscured when you see a wall of code. There's not that much use in reading each function individually as they're all just variations of the same function. The macros we had in the devirtualization PRs were a lot more complex, but I think this one is very simple and makes it easier for me to read, and I think for @randomPoison, too.

src/looprestoration.rs Outdated Show resolved Hide resolved
@randomPoison randomPoison merged commit d6ec80e into main Jul 20, 2023
8 checks passed
@randomPoison randomPoison deleted the randompoison/looprestorationfilterfn-type-erased branch July 20, 2023 18:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants