Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

abuse of 'inline fn' #5

Open
andrewrk opened this issue Feb 8, 2024 · 6 comments
Open

abuse of 'inline fn' #5

andrewrk opened this issue Feb 8, 2024 · 6 comments

Comments

@andrewrk
Copy link

andrewrk commented Feb 8, 2024

Hi, this is some exciting work.

I want to call attention to what the language reference has to say about inline functions:

It is generally better to let the compiler decide when to inline a function, except for these scenarios:

  • To change how many stack frames are in the call stack, for debugging purposes.
  • To force comptime-ness of the arguments to propagate to the return value of the function, as in the above example.
  • Real world performance measurements demand it.

Note that inline actually restricts what the compiler is allowed to do. This can harm binary size, compilation speed, and even runtime performance.

I noticed that inline fn is used quite a bit and I was wondering if it is actually required for performance?

To test this, I took a random sample and removed the inline keyword:

diff --git a/src/Lookup.zig b/src/Lookup.zig
index c5c21d1..88b3e34 100644
--- a/src/Lookup.zig
+++ b/src/Lookup.zig
@@ -28,11 +28,11 @@ pub fn add(self: *Self, data: []const u8, pos: u16) u16 {
 
 // Retruns previous location with the same hash value given the current
 // position.
-pub inline fn prev(self: *Self, pos: u16) u16 {
+pub fn prev(self: *Self, pos: u16) u16 {
     return self.chain[pos];
 }
 
-inline fn set(self: *Self, h: u32, pos: u16) u16 {
+fn set(self: *Self, h: u32, pos: u16) u16 {
     const p = self.head[h];
     self.head[h] = pos;
     self.chain[pos] = p;
@@ -72,14 +72,14 @@ pub fn bulkAdd(self: *Self, data: []const u8, len: u16, pos: u16) void {
 }
 
 // Calculates hash of the first 4 bytes of `b`.
-inline fn hash(b: *const [4]u8) u32 {
+fn hash(b: *const [4]u8) u32 {
     return hashu(@as(u32, b[3]) |
         @as(u32, b[2]) << 8 |
         @as(u32, b[1]) << 16 |
         @as(u32, b[0]) << 24);
 }
 
-inline fn hashu(v: u32) u32 {
+fn hashu(v: u32) u32 {
     return @intCast((v *% prime4) >> consts.lookup.shift);
 }
 
diff --git a/src/bit_reader.zig b/src/bit_reader.zig
index 59b3387..bf24b48 100644
--- a/src/bit_reader.zig
+++ b/src/bit_reader.zig
@@ -41,7 +41,7 @@ pub fn BitReader(comptime ReaderType: type) type {
         // bits to decode. So `nice` is not hard limit, it will just try to have
         // that number of bits available. If end of forward stream is reached
         // it may be some extra zero bits in buffer.
-        pub inline fn fill(self: *Self, nice: u6) !void {
+        pub fn fill(self: *Self, nice: u6) !void {
             if (self.nbits >= nice) {
                 return; // We have enought bits
             }
@@ -87,17 +87,17 @@ pub fn BitReader(comptime ReaderType: type) type {
         };
 
         // Alias for readF(U, 0).
-        pub inline fn read(self: *Self, comptime U: type) !U {
+        pub fn read(self: *Self, comptime U: type) !U {
             return self.readF(U, 0);
         }
 
         // Alias for readF with flag.peak set.
-        pub inline fn peekF(self: *Self, comptime U: type, comptime how: u3) !U {
+        pub fn peekF(self: *Self, comptime U: type, comptime how: u3) !U {
             return self.readF(U, how | flag.peek);
         }
 
         // Read with flags provided.
-        pub inline fn readF(self: *Self, comptime U: type, comptime how: u3) !U {
+        pub fn readF(self: *Self, comptime U: type, comptime how: u3) !U {
             const n: u6 = @bitSizeOf(U);
             switch (how) {
                 0 => { // `normal` read
@@ -141,7 +141,7 @@ pub fn BitReader(comptime ReaderType: type) type {
 
         // Read n number of bits.
         // Only buffered flag can be used in how.
-        pub inline fn readN(self: *Self, n: u4, comptime how: u3) !u16 {
+        pub fn readN(self: *Self, n: u4, comptime how: u3) !u16 {
             switch (how) {
                 0 => {
                     try self.fill(n);
@@ -156,14 +156,14 @@ pub fn BitReader(comptime ReaderType: type) type {
         }
 
         // Advance buffer for n bits.
-        pub inline fn shift(self: *Self, n: u6) !void {
+        pub fn shift(self: *Self, n: u6) !void {
             if (n > self.nbits) return error.EndOfStream;
             self.bits >>= n;
             self.nbits -= n;
         }
 
         // Skip n bytes.
-        pub inline fn skipBytes(self: *Self, n: u16) !void {
+        pub fn skipBytes(self: *Self, n: u16) !void {
             for (0..n) |_| {
                 try self.fill(8);
                 try self.shift(8);
@@ -171,12 +171,12 @@ pub fn BitReader(comptime ReaderType: type) type {
         }
 
         // Number of bits to align stream to the byte boundary.
-        inline fn alignBits(self: *Self) u3 {
+        fn alignBits(self: *Self) u3 {
             return @intCast(self.nbits & 0x7);
         }
 
         // Align stream to the byte boundary.
-        pub inline fn alignToByte(self: *Self) void {
+        pub fn alignToByte(self: *Self) void {
             const ab = self.alignBits();
             if (ab > 0) self.shift(ab) catch unreachable;
         }
diff --git a/src/bit_writer.zig b/src/bit_writer.zig
index 1fc29d1..b5d84c7 100644
--- a/src/bit_writer.zig
+++ b/src/bit_writer.zig
@@ -60,7 +60,7 @@ pub fn BitWriter(comptime WriterType: type) type {
             self.nbytes = 0;
         }
 
-        pub inline fn writeBits(self: *Self, b: u32, nb: u32) Error!void {
+        pub fn writeBits(self: *Self, b: u32, nb: u32) Error!void {
             self.bits |= @as(u64, @intCast(b)) << @as(u6, @intCast(self.nbits));
             self.nbits += nb;
             if (self.nbits < 48)
diff --git a/src/block_writer.zig b/src/block_writer.zig
index 56b3f9c..2efc917 100644
--- a/src/block_writer.zig
+++ b/src/block_writer.zig
@@ -51,7 +51,7 @@ pub fn BlockWriter(comptime WriterType: type) type {
             self.bit_writer.setWriter(new_writer);
         }
 
-        inline fn writeCode(self: *Self, c: hc.HuffCode) Error!void {
+        fn writeCode(self: *Self, c: hc.HuffCode) Error!void {
             try self.bit_writer.writeBits(c.code, c.len);
         }
 
diff --git a/src/deflate.zig b/src/deflate.zig
index fffed2e..81348df 100644
--- a/src/deflate.zig
+++ b/src/deflate.zig
@@ -199,27 +199,27 @@ fn Deflate(comptime container: Container, comptime WriterType: type, comptime Bl
             }
         }
 
-        inline fn windowAdvance(self: *Self, step: u16, lh: []const u8, pos: u16) void {
+        fn windowAdvance(self: *Self, step: u16, lh: []const u8, pos: u16) void {
             // current position is already added in findMatch
             self.lookup.bulkAdd(lh[1..], step - 1, pos + 1);
             self.win.advance(step);
         }
 
         // Add previous literal (if any) to the tokens list.
-        inline fn addPrevLiteral(self: *Self) !void {
+        fn addPrevLiteral(self: *Self) !void {
             if (self.prev_literal) |l| try self.addToken(Token.initLiteral(l));
         }
 
         // Add match to the tokens list, reset prev pointers.
         // Returns length of the added match.
-        inline fn addMatch(self: *Self, m: Token) !u16 {
+        fn addMatch(self: *Self, m: Token) !u16 {
             try self.addToken(m);
             self.prev_literal = null;
             self.prev_match = null;
             return m.length();
         }
 
-        inline fn addToken(self: *Self, token: Token) !void {
+        fn addToken(self: *Self, token: Token) !void {
             self.tokens.add(token);
             if (self.tokens.full()) try self.flushTokens(false);
         }
@@ -274,7 +274,7 @@ fn Deflate(comptime container: Container, comptime WriterType: type, comptime Bl
         }
 
         // Slide win and if needed lookup tables.
-        inline fn slide(self: *Self) void {
+        fn slide(self: *Self) void {
             const n = self.win.slide();
             self.lookup.slide(n);
         }
diff --git a/src/huffman_decoder.zig b/src/huffman_decoder.zig
index 455da3b..90a0916 100644
--- a/src/huffman_decoder.zig
+++ b/src/huffman_decoder.zig
@@ -100,7 +100,7 @@ fn HuffmanDecoder(
         }
 
         /// Finds symbol for lookup table code.
-        pub inline fn find(self: *Self, code: u16) Symbol {
+        pub fn find(self: *Self, code: u16) Symbol {
             if (small_lookup_bits > 0) {
                 const code_s = code >> small_lookup_shift;
                 const sym = self.lookup_s[code_s];
diff --git a/src/huffman_encoder.zig b/src/huffman_encoder.zig
index 67d01e3..f4feb26 100644
--- a/src/huffman_encoder.zig
+++ b/src/huffman_encoder.zig
@@ -453,7 +453,7 @@ test "generate a Huffman code for the 30 possible relative distances (LZ77 dista
 }
 
 // Reverse bit-by-bit a N-bit code.
-inline fn bitReverse(comptime T: type, value: T, n: usize) T {
+fn bitReverse(comptime T: type, value: T, n: usize) T {
     const r = @bitReverse(value);
     return r >> @as(math.Log2Int(T), @intCast(@typeInfo(T).Int.bits - n));
 }

Then I observed a significant performance improvement in inflate_bench (ReleaseFast):

Benchmark 1 (146 runs): ./zig-out-old/bin/inflate_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           411ms ± 7.91ms     388ms …  425ms          4 ( 3%)        0%
  peak_rss           25.3MB ± 2.53KB    25.2MB … 25.3MB         56 (38%)        0%
  cpu_cycles         1.75G  ± 7.31M     1.75G  … 1.81G           6 ( 4%)        0%
  instructions       3.05G  ± 27.5      3.05G  … 3.05G           0 ( 0%)        0%
  cache_references   1.62M  ±  748K     1.13M  … 5.66M          15 (10%)        0%
  cache_misses        629K  ± 3.95K      621K  …  643K           1 ( 1%)        0%
  branch_misses      25.1M  ± 19.8K     25.0M  … 25.2M          11 ( 8%)        0%
Benchmark 2 (151 runs): ./zig-out/bin/inflate_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           396ms ± 4.20ms     389ms …  405ms          0 ( 0%)        ⚡-  3.6% ±  0.3%
  peak_rss           25.3MB ± 2.59KB    25.2MB … 25.3MB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         1.70G  ± 1.58M     1.70G  … 1.70G          10 ( 7%)        ⚡-  3.2% ±  0.1%
  instructions       3.17G  ± 30.0      3.17G  … 3.17G           0 ( 0%)        💩+  3.8% ±  0.0%
  cache_references   1.62M  ±  847K     1.15M  … 6.66M          18 (12%)          -  0.5% ± 11.2%
  cache_misses        630K  ± 4.98K      622K  …  652K           7 ( 5%)          +  0.1% ±  0.2%
  branch_misses      24.9M  ± 10.0K     24.8M  … 24.9M          13 ( 9%)          -  1.0% ±  0.0%

It also caused the binary size to shrink by 28K.

For deflate, I observed no significant performance difference in wall time and a shrinking of only 5K:

Benchmark 1 (31 runs): ./zig-out-old/bin/deflate_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.95s  ± 47.7ms    1.78s  … 2.03s           1 ( 3%)        0%
  peak_rss           1.11MB ± 3.13KB    1.11MB … 1.11MB          0 ( 0%)        0%
  cpu_cycles         8.05G  ± 91.5M     7.92G  … 8.36G           1 ( 3%)        0%
  instructions       13.6G  ± 60.8      13.6G  … 13.6G           1 ( 3%)        0%
  cache_references    120M  ± 45.9M     65.4M  …  323M           1 ( 3%)        0%
  cache_misses        191K  ±  266K     6.73K  … 1.09M           4 (13%)        0%
  branch_misses       107M  ± 22.9K      107M  …  107M           3 (10%)        0%
Benchmark 2 (32 runs): ./zig-out/bin/deflate_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time          1.94s  ± 44.6ms    1.88s  … 2.10s           1 ( 3%)          -  0.5% ±  1.2%
  peak_rss           1.11MB ± 3.57KB    1.10MB … 1.11MB          0 ( 0%)          -  0.4% ±  0.2%
  cpu_cycles         8.26G  ± 87.1M     8.07G  … 8.40G           0 ( 0%)        💩+  2.6% ±  0.6%
  instructions       13.7G  ± 77.8      13.7G  … 13.7G           0 ( 0%)        💩+  1.2% ±  0.0%
  cache_references    134M  ± 33.1M     72.2M  …  199M           0 ( 0%)          + 11.8% ± 16.8%
  cache_misses        103K  ±  194K     9.24K  …  948K           5 (16%)          - 46.3% ± 61.2%
  branch_misses       105M  ± 19.0K      105M  …  105M           0 ( 0%)        ⚡-  1.5% ±  0.0%

In conclusion, my suggestion is to delete inline from all the functions, and then do some more performance testing and determine where it needs to be added back. Probably only 1 or 2 places.

@ianic
Copy link
Owner

ianic commented Feb 8, 2024

Thanks Andrew. Great comment.
I totally agree that I spread that inlines through code without sound criteria.

I create no_inline brach where I removed all of them. But didn't actually reproduce your result (is there a chance that you switch what is new and what old in inflate_bench results).

I got this for inflate_bench. First is code from main branch and second is with all inlines removed.

 poop -d 60000 'zig-out-main/bin/inflate_bench' 'zig-out/bin/inflate_bench'
Benchmark 1 (117 runs): zig-out-main/bin/inflate_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           516ms ± 8.15ms     512ms …  593ms         18 (15%)        0%
  peak_rss           24.9MB ±    0      24.9MB … 24.9MB          0 ( 0%)        0%
  cpu_cycles         1.83G  ± 23.3M     1.83G  … 2.08G          12 (10%)        0%
  instructions       3.22G  ± 8.38      3.22G  … 3.22G           1 ( 1%)        0%
  cache_references    116K  ± 50.5K     93.3K  …  622K           8 ( 7%)        0%
  cache_misses       58.4K  ± 2.22K     52.2K  … 77.6K           6 ( 5%)        0%
  branch_misses      25.6M  ± 26.1K     25.5M  … 25.8M           4 ( 3%)        0%
Benchmark 2 (112 runs): zig-out/bin/inflate_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           539ms ± 3.46ms     536ms …  549ms         16 (14%)        💩+  4.5% ±  0.3%
  peak_rss           24.9MB ±    0      24.9MB … 24.9MB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         1.91G  ± 1.77M     1.91G  … 1.93G          10 ( 9%)        💩+  4.6% ±  0.2%
  instructions       3.34G  ± 10.4      3.34G  … 3.34G           1 ( 1%)        💩+  3.8% ±  0.0%
  cache_references    113K  ± 10.8K     96.0K  …  151K           3 ( 3%)          -  3.0% ±  8.2%
  cache_misses       58.6K  ± 1.63K     51.6K  … 61.0K           5 ( 4%)          +  0.4% ±  0.9%
  branch_misses      25.4M  ± 8.93K     25.3M  … 25.4M           1 ( 1%)          -  0.8% ±  0.0%

For deflate differences were very small.

I then started with original inflate code and removed inlines except for few of them for which I found that have some impact.
So I returned that few.
Here are results with those 'essential' inlines:

poop -d 60000 'zig-out-main/bin/inflate_bench' 'zig-out/bin/inflate_bench'
Benchmark 1 (116 runs): zig-out-main/bin/inflate_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           518ms ± 4.82ms     513ms …  526ms          0 ( 0%)        0%
  peak_rss           24.9MB ±    0      24.9MB … 24.9MB          0 ( 0%)        0%
  cpu_cycles         1.83G  ± 1.14M     1.83G  … 1.83G           8 ( 7%)        0%
  instructions       3.22G  ± 5.51      3.22G  … 3.22G           3 ( 3%)        0%
  cache_references    103K  ± 9.96K     85.4K  …  136K           4 ( 3%)        0%
  cache_misses       58.9K  ±  989      55.4K  … 60.6K           5 ( 4%)        0%
  branch_misses      25.6M  ± 12.2K     25.6M  … 25.6M           7 ( 6%)        0%
Benchmark 2 (118 runs): zig-out/bin/inflate_bench
  measurement          mean ± σ            min … max           outliers         delta
  wall_time           512ms ± 4.29ms     509ms …  538ms         15 (13%)          -  1.1% ±  0.2%
  peak_rss           24.9MB ±    0      24.9MB … 24.9MB          0 ( 0%)          -  0.0% ±  0.0%
  cpu_cycles         1.82G  ± 1.63M     1.81G  … 1.82G           2 ( 2%)          -  0.7% ±  0.0%
  instructions       3.28G  ± 8.98      3.28G  … 3.28G           1 ( 1%)        💩+  1.7% ±  0.0%
  cache_references    112K  ± 15.8K     91.0K  …  183K           5 ( 4%)        💩+  8.9% ±  3.3%
  cache_misses       58.6K  ± 1.44K     53.2K  … 61.0K           4 ( 3%)          -  0.5% ±  0.5%
  branch_misses      25.5M  ± 41.4K     25.5M  … 25.6M           0 ( 0%)          -  0.1% ±  0.0%

@andrewrk
Copy link
Author

andrewrk commented Feb 8, 2024

The first thing I did was remove all of them, and I observed similar results as you. Then I reverted a handful of files and tried again, and got the results that I shared above. The diff I provided shows which functions in particular I removed inline from. I confirmed just now that I did not accidentally swap old and new.

To reiterate, the diff I provided above is not removing all inline functions. If you want to try the diff you can copy it, open a terminal, cd to flate, run patch -p1 (enter) (paste) (ctrl+D)

@ianic
Copy link
Owner

ianic commented Feb 9, 2024

Sorry, seems that we were working on different commits.

When I apply patch to the main HEAD (71eb86a) I don't get better results, ~4% worse.

But when I apply it to the HEAD-1 (1e7a9b7) I get 3-4% better result.

@andrewrk
Copy link
Author

andrewrk commented Feb 9, 2024

This may be due to random changes in layout causing big differences in performance due to locality. This is a known problem with trying to do benchmarking like this. There is not a simple way to avoid this.

Perhaps it would make sense to collect measurements only after running the binary through BOLT

@ianic
Copy link
Owner

ianic commented Feb 9, 2024

When testing whether an inline has effect or not should I compare ReleaseSafe or ReleaseFast builds.

I see some cases where single inline has huge impact in ReleaseSafe but no impact in ReleaseFast. By putting that inline I'm somehow turning ReleaseSafe to ReleaseFast.

@andrewrk
Copy link
Author

andrewrk commented Feb 9, 2024

ReleaseFast for sure. ReleaseSafe is appropriate for code with high churn, not a fuzz-tested library of unchanging code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants