Moving line "hash_vec.push(h)" slows code 1.6x #113504

safinaskar · 2023-07-09T12:09:30Z

Merely moving line hash_vec.push(h) from one place to another slows code down 1.6x. This is very unfortunate, because this prevents me from converting my sequential code to rayon (I will explain why). Also, all these problems appeared in real production code base, I will explain this.

So, here is two code examples, I will call them UPLMUT (for lack of better names):

// UPLMUT slow
pub fn main() {
    let mut hash_vec = vec![];
    for n in 0.. {
        let buf = vec![0u8; 4194304];
        let h;
        if let Some(item) = (if n < 100 { Some(buf) } else { None }) {
            for _ in 0..5 {
                std::hint::black_box(item.clone());
            }
            h = vec![0u8, 0u8];
        } else {
            break;
        }
        hash_vec.push(h);
    }
    std::hint::black_box(hash_vec);
}

// UPLMUT fast
pub fn main() {
    let mut hash_vec = vec![];
    for n in 0.. {
        let buf = vec![0u8; 4194304];
        let h;
        if let Some(item) = (if n < 100 { Some(buf) } else { None }) {
            for _ in 0..5 {
                std::hint::black_box(item.clone());
            }
            h = vec![0u8, 0u8];
            hash_vec.push(h);
        } else {
            break;
        }
    }
    std::hint::black_box(hash_vec);
}

The only difference is placement of hash_vec.push(h). Yet "UPLMUT slow" is slower than "UPLMUT fast" 1.6 times. Here is result of running "UPLMUT slow" 10 times and then "UPLMUT fast" 10 times (using time -p):

real 0.28
user 0.19
sys 0.08
real 0.25
user 0.16
sys 0.08
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.17
sys 0.11
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.18
sys 0.10
real 0.28
user 0.17
sys 0.11
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.18
sys 0.10
real 0.28
user 0.19
sys 0.08
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.19
sys 0.09
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.16
sys 0.12
real 0.28
user 0.18
sys 0.09
real 0.29
user 0.19
sys 0.09
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.20
sys 0.08
real 0.28
user 0.17
sys 0.10
real 0.28
user 0.19
sys 0.08
real 0.24
user 0.14
sys 0.10
real 0.28
user 0.18
sys 0.09
real 0.28
user 0.19
sys 0.09
real 0.25
user 0.16
sys 0.08
real 0.27
user 0.18
sys 0.09
real 0.28
user 0.19
sys 0.09
real 0.27
user 0.15
sys 0.12
real 0.28
user 0.20
sys 0.08

real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.22
user 0.21
sys 0.00
real 0.18
user 0.17
sys 0.00
real 0.21
user 0.20
sys 0.00
real 0.21
user 0.20
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.16
user 0.15
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.16
user 0.15
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.18
user 0.17
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.16
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.18
user 0.18
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.17
user 0.17
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.16
sys 0.00
real 0.17
user 0.17
sys 0.00

Test system is Debian Sid Linux x86_64 with rustc 1.72.0-nightly (f7ca9df69 2023-06-24) running in Docker container in Linux 5.10.0.

Here is godbolt with assembly diff: https://godbolt.org/z/orT9oPcsr .

You may say something like this: "well, you just call malloc and free in different order, so there is no surprise performance is different". No! Look carefully at this code. This two sources perform the same algorithm. I. e. merely optimization should transform one to another. So I'm nearly sure there is a bug in rustc, not in malloc, not in some other place.

You may say: "This is artificial example. Also, if hash_vec.push(h) changes performance, just put it in proper place and everything will be fast". No! This appeared in my real code, and I will explain.

Here is version, which is slightly closer to my real code. I will call it YLAURR.

// YLAURR slow
fn main() {
    let mut hash_vec = vec![];
    (0..100).map(|_|vec![0u8; 4194304]).map(|item|{
            for _ in 0..5 {
                std::hint::black_box(item.clone());
            }
            vec![0u8, 0u8]
    }).for_each(|h|hash_vec.push(h));
    std::hint::black_box(hash_vec);
}

// YLAURR fast
fn main() {
    let mut hash_vec = vec![];
    (0..100).map(|_|vec![0u8; 4194304]).for_each(|item|{
            for _ in 0..5 {
                std::hint::black_box(item.clone());
            }
            hash_vec.push(vec![0u8, 0u8]);
    });
    std::hint::black_box(hash_vec);
}

In this case splitting for_each(...) to map(...).for_each(...) makes code significantly slower. You may again say that this is artificial example or you may say "Just don't split for_each to map.for_each".

So, again, here is example even closer to my real code. I will call it WWUUIJ.

[package]
name = "my-chunker"
version = "0.1.0"
edition = "2021"

[dependencies]
blake3 = "1.4.0"
hex = "0.4.3"
libc = "0.2.147"
rayon = "1.7.0"
zstd = "0.12.3"

// fast
use std::io::Read;

fn main() {
        let chunk_size: usize = 4194304;

        let mut hash_vec = vec![];
        let mut input = std::io::stdin().lock();
        std::iter::from_fn(|| {
            let mut buf = vec![0u8; chunk_size];
            if let Ok(()) = input.read_exact(&mut buf) {
                Some(buf)
            } else {
                None
            }
        }).for_each(|chunk| {
                        let raw_hash = blake3::hash(&chunk);
                        std::hint::black_box(zstd::bulk::compress(&chunk, -22).unwrap());
                        let x: Vec<u8> = raw_hash.as_bytes().to_vec();
                        hash_vec.push(x);
        });
        std::hint::black_box(hash_vec);
}

// slow
use std::io::Read;

fn main() {
        let chunk_size: usize = 4194304;

        let mut hash_vec = vec![];
        let mut input = std::io::stdin().lock();
        std::iter::from_fn(|| {
            let mut buf = vec![0u8; chunk_size];
            if let Ok(()) = input.read_exact(&mut buf) {
                Some(buf)
            } else {
                None
            }
        }).map(|chunk| {
                        let raw_hash = blake3::hash(&chunk);
                        std::hint::black_box(zstd::bulk::compress(&chunk, -22).unwrap());
                        let x: Vec<u8> = raw_hash.as_bytes().to_vec();
                        x
        }).for_each(|h| {
            hash_vec.push(h);
        });
        std::hint::black_box(hash_vec);
}

Pipe to program's stdin 2 GiB of /dev/urandom while testing.

If the example still looks artificial, okay, here is real story behind all this. I'm trying to write my replacement for borg and casync. I wrote about it here: borgbackup/borg#7674 (comment) . My program supports two operations: "add" and "extract". Operation "add" splits input file to chunks and then for every chunk compresses it, writes it to its own "chunk file" and also adds its hash to (single) "index file". So programs algorithm can be described using such pseudocode:

// Pseudo-code
let iter = /* Iterator, which produces chunks */
iter.for_each(|chunk| {
  // Compress chunk
  // Write compressed data to (its own) "chunk file"
  // Add hash to (single) "index file"
});

Now I decided to parallelize the program using rayon. As a first step I split above closure to two parts: one parallelizable and one sequential:

// Pseudo-code
let iter = /* Iterator, which produces chunks */
iter.map(|chunk| {
  // Compress chunk
  // Write compressed data to (its own) "chunk file"
  let hash = /* ... */
  return hash;
}).for_each(|h|{
  // Add hash to (single) "index file"
});

Unfortunately, such splitting surprisingly made the program slower! So, first step of converting the program to rayon (to make it faster!) made it slower. This is very bad. I started to investigate why it is so. So I reduced the problem to code WWUUIJ, then to YLAURR and finally to UPLMUT. So, I think core problem is in rustc. And I hope I convinced you that this bug is totally realistic

The text was updated successfully, but these errors were encountered:

safinaskar · 2023-07-09T12:29:52Z

@rustbot label +I-slow +T-compiler +WG-codegen +A-codegen

the8472 · 2023-07-09T14:50:22Z

I doubt that the rather minor assembly changes could result in such a huge difference. Godbolt might not be representative in this case, maybe due to a difference in optimization flags.

The first example spends time in the kernel while the second one doesn't. That could indicate that calls to the allocator were optimized out in one case (making this a benchmarking artifact) and not in the other.

Try running it under a profiler or objdumping the assembly and diffing those.

saethlin · 2023-07-09T22:52:38Z

I reduced the first example to this:

pub fn main() {
    let mut hash_vec = vec![];
    for n in 0..100 {
        let buf = vec![0u8; 4194304];
        for _ in 0..5 {
            std::hint::black_box(buf.clone());
        }
        let h = vec![0u8, 0u8];
        drop(buf);
        hash_vec.push(h);
    }   
    std::hint::black_box(hash_vec);
}

Moving the drop(buf) to after the push produces perf stat output that matches the fast version. The difference in drop order is hard to see in the original code because buf is moved into the if let, so its drop location is not easy to see in relation to the push.

Based on profiling, the difference seems attributable to a 44x (!!) difference in the number of page faults between the two implementations. If I swap in jemalloc or mimalloc, the difference in runtime and page faults goes away. So I strongly suspect that this code is generating some worst-case behavior in glibc's allocator.

If you have an allocation-intensive workload it is always good advice to use a non-default allocator. This effect is surprisingly dramatic, but I don't think there's anything for us to do here.

safinaskar · 2023-07-10T13:30:47Z

I added Vec::with_capacity to my actual code and now everything is fast. So, problem is solved for me!

Also, I redirected the bug to glibc (I rewrote @saethlin's code to C): https://sourceware.org/bugzilla/show_bug.cgi?id=30625 , so I'm closing the bug here

safinaskar added the C-bug Category: This is a bug. label Jul 9, 2023

rustbot added the needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. label Jul 9, 2023

rustbot added A-codegen Area: Code generation I-slow Issue: Problems and improvements with respect to performance of generated code. T-compiler Relevant to the compiler team, which will review and decide on the PR/issue. WG-llvm Working group: LLVM backend code generation labels Jul 9, 2023

saethlin removed needs-triage This issue may need triage. Remove it if it has been sufficiently triaged. A-codegen Area: Code generation WG-llvm Working group: LLVM backend code generation labels Jul 9, 2023

safinaskar closed this as completed Jul 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving line "hash_vec.push(h)" slows code 1.6x #113504

Moving line "hash_vec.push(h)" slows code 1.6x #113504

safinaskar commented Jul 9, 2023

safinaskar commented Jul 9, 2023

the8472 commented Jul 9, 2023 •

edited

Loading

saethlin commented Jul 9, 2023

safinaskar commented Jul 10, 2023

Moving line "hash_vec.push(h)" slows code 1.6x #113504

Moving line "hash_vec.push(h)" slows code 1.6x #113504

Comments

safinaskar commented Jul 9, 2023

safinaskar commented Jul 9, 2023

the8472 commented Jul 9, 2023 • edited Loading

saethlin commented Jul 9, 2023

safinaskar commented Jul 10, 2023

the8472 commented Jul 9, 2023 •

edited

Loading