Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

struct Dav1dMCDSPContext: Dispatch through indirect fn ptrs inside of bitdepth-dependent fns #331

Closed

Conversation

kkysen
Copy link
Collaborator

@kkysen kkysen commented Jul 18, 2023

This may help reduce the instruction count and thus not be worse in performance compared to the purely fn ptr version, but we'll see exactly how much (if at all).

@kkysen kkysen requested a review from thedataking July 18, 2023 19:54
@kkysen
Copy link
Collaborator Author

kkysen commented Jul 18, 2023

@thedataking, could you re-benchmark this one, too, on the AVX512 machine? Thanks!

@kkysen
Copy link
Collaborator Author

kkysen commented Jul 18, 2023

It doesn't compile on arm yet, but this should be enough to benchmark it on x86_64 to see if it's worth it at all.

@thedataking
Copy link
Collaborator

It looks like this change would regress performance.

main on amy:

 Performance counter stats for 'target/release/dav1d_main -i /home/perl/.phoronix-test-suite/installed-tests/pts/dav1d-1.13.0/summer_nature_1080p.ivf -o /dev/null' (10 runs):

        431,493.49 msec task-clock                       #  137.049 CPUs utilized               ( +-  9.58% )
         2,686,890      context-switches                 #   11.358 K/sec                       ( +-  9.56% )
           241,000      cpu-migrations                   #    1.019 K/sec                       ( +-  9.51% )
           229,047      page-faults                      #  968.242 /sec                        ( +-  9.53% )
 2,087,648,036,957      cycles                           #    8.825 GHz                         ( +-  9.57% )  (33.24%)
     4,227,888,146      stalled-cycles-frontend          #    0.37% frontend cycles idle        ( +-  9.51% )  (33.29%)
    95,775,550,882      stalled-cycles-backend           #    8.36% backend cycles idle         ( +-  9.58% )  (33.35%)
 3,280,933,862,233      instructions                     #    2.86  insn per cycle
                                                  #    0.02  stalled cycles per insn     ( +-  9.58% )  (33.46%)
   365,372,275,615      branches                         #    1.545 G/sec                       ( +-  9.58% )  (33.58%)
    10,750,889,079      branch-misses                    #    5.35% of all branches             ( +-  9.58% )  (33.66%)
 1,173,408,463,591      L1-dcache-loads                  #    4.960 G/sec                       ( +-  9.58% )  (33.67%)
    54,907,344,945      L1-dcache-load-misses            #    8.51% of all L1-dcache accesses   ( +-  9.57% )  (33.61%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
   182,350,420,327      L1-icache-loads                  #  770.844 M/sec                       ( +-  9.58% )  (33.51%)
     2,130,409,531      L1-icache-load-misses            #    2.12% of all L1-icache accesses   ( +-  9.56% )  (33.47%)
    11,047,434,742      dTLB-loads                       #   46.700 M/sec                       ( +-  9.54% )  (33.43%)
       699,776,941      dTLB-load-misses                 #   11.58% of all dTLB cache accesses  ( +-  9.55% )  (33.37%)
       586,365,696      iTLB-loads                       #    2.479 M/sec                       ( +-  9.02% )  (33.36%)
        34,390,790      iTLB-load-misses                 #    9.99% of all iTLB cache accesses  ( +-  9.43% )  (33.32%)
    13,044,351,246      L1-dcache-prefetches             #   55.142 M/sec                       ( +-  9.58% )  (33.27%)
   <not supported>      L1-dcache-prefetch-misses

            3.1485 +- 0.0103 seconds time elapsed  ( +-  0.33% )

kkysen/devirtualize-mc-indirect on amy:

 Performance counter stats for 'target/release/dav1d_mc_indirect -i /home/perl/.phoronix-test-suite/installed-tests/pts/dav1d-1.13.0/summer_nature_1080p.ivf -o /dev/null' (10 runs):

        462,105.07 msec task-clock                       #  135.885 CPUs utilized               ( +-  9.59% )
         2,530,994      context-switches                 #    9.988 K/sec                       ( +-  9.50% )
           268,008      cpu-migrations                   #    1.058 K/sec                       ( +-  9.84% )
           228,258      page-faults                      #  900.738 /sec                        ( +-  9.60% )
 2,235,047,462,245      cycles                           #    8.820 GHz                         ( +-  9.59% )  (33.23%)
     4,595,687,183      stalled-cycles-frontend          #    0.37% frontend cycles idle        ( +-  9.57% )  (33.31%)
   108,933,740,731      stalled-cycles-backend           #    8.89% backend cycles idle         ( +-  9.61% )  (33.31%)
 3,408,763,954,940      instructions                     #    2.78  insn per cycle
                                                  #    0.02  stalled cycles per insn     ( +-  9.57% )  (33.41%)
   379,986,315,022      branches                         #    1.499 G/sec                       ( +-  9.57% )  (33.51%)
    10,818,380,238      branch-misses                    #    5.18% of all branches             ( +-  9.57% )  (33.59%)
 1,223,546,578,914      L1-dcache-loads                  #    4.828 G/sec                       ( +-  9.57% )  (33.53%)
    64,870,772,256      L1-dcache-load-misses            #    9.64% of all L1-dcache accesses   ( +-  9.56% )  (33.52%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
   186,083,422,221      L1-icache-loads                  #  734.311 M/sec                       ( +-  9.57% )  (33.49%)
     2,394,506,643      L1-icache-load-misses            #    2.34% of all L1-icache accesses   ( +-  9.56% )  (33.51%)
    13,531,234,110      dTLB-loads                       #   53.396 M/sec                       ( +-  9.62% )  (33.47%)
       721,226,217      dTLB-load-misses                 #    9.73% of all dTLB cache accesses  ( +-  9.61% )  (33.45%)
       564,822,964      iTLB-loads                       #    2.229 M/sec                       ( +-  9.62% )  (33.45%)
        38,534,660      iTLB-load-misses                 #   12.54% of all iTLB cache accesses  ( +-  9.51% )  (33.38%)
    13,718,626,090      L1-dcache-prefetches             #   54.136 M/sec                       ( +-  9.58% )  (33.27%)
   <not supported>      L1-dcache-prefetch-misses

            3.4007 +- 0.0247 seconds time elapsed  ( +-  0.73% )

main on donna:

 Performance counter stats for 'target/release/dav1d_main -i /tmp/summer_nature_1080p.ivf -o /dev/null' (10 runs):

         72,656.29 msec task-clock                #   15.901 CPUs utilized            ( +-  0.38% )
           722,725      context-switches          #    9.837 K/sec                    ( +-  0.41% )
            19,175      cpu-migrations            #  260.997 /sec                     ( +-  1.08% )
            45,929      page-faults               #  625.155 /sec                     ( +-  0.35% )
   279,644,470,112      cycles                    #    3.806 GHz                      ( +-  0.70% )  (33.14%)
     1,885,631,074      stalled-cycles-frontend   #    0.66% frontend cycles idle     ( +-  0.18% )  (33.04%)
    25,263,784,676      stalled-cycles-backend    #    8.90% backend cycles idle      ( +-  1.54% )  (32.88%)
   361,993,846,839      instructions              #    1.28  insn per cycle
                                                  #    0.07  stalled cycles per insn  ( +-  0.07% )  (33.15%)
    38,210,726,676      branches                  #  520.099 M/sec                    ( +-  0.05% )  (33.22%)
     1,213,950,217      branch-misses             #    3.18% of all branches          ( +-  0.16% )  (33.44%)
   124,785,807,301      L1-dcache-loads           #    1.699 G/sec                    ( +-  0.04% )  (33.68%)
     5,584,477,464      L1-dcache-load-misses     #    4.47% of all L1-dcache accesses  ( +-  0.28% )  (33.83%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
    28,724,499,259      L1-icache-loads           #  390.979 M/sec                    ( +-  0.05% )  (33.76%)
       448,893,743      L1-icache-load-misses     #    1.56% of all L1-icache accesses  ( +-  0.24% )  (33.87%)
     1,771,108,036      dTLB-loads                #   24.107 M/sec                    ( +-  0.13% )  (33.71%)
       131,660,378      dTLB-load-misses          #    7.44% of all dTLB cache accesses  ( +-  0.16% )  (33.66%)
        67,381,885      iTLB-loads                #  917.157 K/sec                    ( +-  0.35% )  (33.61%)
         5,307,938      iTLB-load-misses          #    7.92% of all iTLB cache accesses  ( +-  1.06% )  (33.44%)
     1,830,368,661      L1-dcache-prefetches      #   24.914 M/sec                    ( +-  0.34% )  (33.24%)
   <not supported>      L1-dcache-prefetch-misses

            4.5693 +- 0.0201 seconds time elapsed  ( +-  0.44% )

kkysen/devirtualize-mc-indirect on donna:

 Performance counter stats for 'target/release/dav1d_mc_indirect -i /tmp/summer_nature_1080p.ivf -o /dev/null' (10 runs):

         77,115.91 msec task-clock                #   16.603 CPUs utilized            ( +-  0.34% )
           700,553      context-switches          #    9.306 K/sec                    ( +-  0.25% )
            17,129      cpu-migrations            #  227.538 /sec                     ( +-  1.36% )
            44,295      page-faults               #  588.406 /sec                     ( +-  0.42% )
   298,743,542,507      cycles                    #    3.968 GHz                      ( +-  0.47% )  (33.36%)
     1,899,534,123      stalled-cycles-frontend   #    0.66% frontend cycles idle     ( +-  0.20% )  (33.25%)
    28,351,565,762      stalled-cycles-backend    #    9.79% backend cycles idle      ( +-  0.93% )  (33.54%)
   373,279,062,170      instructions              #    1.29  insn per cycle
                                                  #    0.07  stalled cycles per insn  ( +-  0.06% )  (33.58%)
    39,490,583,291      branches                  #  524.585 M/sec                    ( +-  0.07% )  (33.56%)
     1,208,988,030      branch-misses             #    3.05% of all branches          ( +-  0.11% )  (33.65%)
   129,535,941,286      L1-dcache-loads           #    1.721 G/sec                    ( +-  0.08% )  (33.91%)
     6,374,806,377      L1-dcache-load-misses     #    4.91% of all L1-dcache accesses  ( +-  0.14% )  (33.49%)
   <not supported>      LLC-loads
   <not supported>      LLC-load-misses
    29,380,057,051      L1-icache-loads           #  390.279 M/sec                    ( +-  0.14% )  (33.67%)
       362,368,616      L1-icache-load-misses     #    1.24% of all L1-icache accesses  ( +-  0.33% )  (33.57%)
     2,061,205,631      dTLB-loads                #   27.381 M/sec                    ( +-  0.12% )  (33.34%)
       131,258,869      dTLB-load-misses          #    6.36% of all dTLB cache accesses  ( +-  0.22% )  (33.12%)
        49,283,601      iTLB-loads                #  654.674 K/sec                    ( +-  0.71% )  (33.33%)
         5,110,826      iTLB-load-misses          #   10.42% of all iTLB cache accesses  ( +-  1.12% )  (33.17%)
     1,912,415,593      L1-dcache-prefetches      #   25.404 M/sec                    ( +-  0.35% )  (33.28%)
   <not supported>      L1-dcache-prefetch-misses

            4.6448 +- 0.0160 seconds time elapsed  ( +-  0.34% )

@kkysen
Copy link
Collaborator Author

kkysen commented Jul 19, 2023

Okay, it looks like this way isn't worth it either. I'll start working on a type-erased fn ptr version of #322.

@kkysen kkysen closed this Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants