Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On a machine with many cores, snapshotting large repos is very CPU-intensive #4508

Open
mindajar opened this issue Sep 20, 2024 · 5 comments
Labels
🐛bug Something isn't working 🍎Mac

Comments

@mindajar
Copy link

Description

On a machine with many cores, snapshotting large repos is very CPU-intensive

Steps to Reproduce the Problem

  1. Check out a very large repo (here, n = ~150,000 files) on a machine with many cores (here, n = 24)
  2. Run jj st (or just jj) and measure the time it takes for the command to complete.
  3. export RAYON_NUM_THREADS=4
  4. Repeat step 2

Expected Behavior

Similar performance in both cases.

Actual Behavior

jj's default behavior:
2.28 real 0.34 user 32.26 sys
jj limited to four threads:
1.13 real 0.16 user 0.49 sys

This one took some doing and profiling to figure out, as it didn't immediately make sense that the same working copy is so much faster to work with on a much smaller machine.

Specifications

  • Platform: macOS
  • Version: 14, 15
@arxanas
Copy link
Collaborator

arxanas commented Sep 20, 2024

Interesting. Can't look now, but it's possible we serialize all tree updates into a single channel and it's contending when there are more threads. It would be in the working copy snapshot code somewhere.

  • One fix could be to batch the channel sends.
  • Another idea could be to use Rayon's parallel iterator bridge to consume the channel contents (which would do batching/work stealing).
  • It might work to do an explicit map/reduce across worker threads, although I don't recall if Rayon easily supports this pattern.

To fix your immediate issue, you can also try enabling the Watchman fsmonitor.

@yuja
Copy link
Collaborator

yuja commented Sep 20, 2024

I heard (iirc when I was working on Mercurial) that it's sometimes faster to scan directory entries sequentially than splitting jobs to worker processes, which tends to lead to random access. I don't know this is the case, though.

@mindajar
Copy link
Author

Not even a little bit urgent -- I was mostly bewildered at what I could possibly have broken on the big machine to make jj status peg all CPUs, and I couldn't stop poking at it until I figured it out :)

(watchman is currently broken in MacPorts, which is how I ended up here)

@PhilipMetzger PhilipMetzger added the 🐛bug Something isn't working label Sep 20, 2024
@thoughtpolice
Copy link
Collaborator

thoughtpolice commented Sep 20, 2024

I can't reproduce this on my 32 core Zen 1 machine (Linux 6.10) with gecko-dev, which is ~1mil commits and 380k working set files. In fact it never gets slower, but it's nowhere close to linear speedup; 32 cores is only ~2x faster than 4 cores (1.20s vs 0.7s). Would you be willing to try this with a repository like gecko-dev and report back to see what it says? It would make it easier for baseline comparisons, at least.

I suspect two things:

  • macOS. As we all know, it's an operating system that is different from Linux.
  • Hardware. You're probably using one of those Mac Studios? I'm guessing it's the 16P+8E configuration. Trying to do this across every core probably has some consequence. I don't remember how to do this but there are ways to force apps to use specific subsets of cores. Would you be willing to test a few combinations of threads on different cores with various RAYON_NUM_THREADS? e.g. 16 P cores versus 8 P cores versus 8 E cores?

I don't have a Studio but I do have a M2 Air, which coincidentally dual boots Fedora. So, if I get a chance I can see how it all shakes out on both systems, Linux vs macOS, but it's only 4P+4E, so it's not going to be as big a deal I suspect.

If it turns out that some other core configuration gives big improvements we can probably make a change to the scheduling policy somehow before we use Rayon so jj sticks to the right settings; a blunt hammer can then be applied with some patch to achieve that.


Note that I couldn't reliably clone gecko-dev from GitHub in one-go due to network errors, so I had to clone a 1-height shallow repo and then 'saturate it' by unshallowing for it to work:

git clone https://github.com/mozilla/gecko-dev --depth 1
cd gecko-dev
git fetch --unshallow
jj git init --colocate

@mindajar
Copy link
Author

Yes, this is a 16P+8E Mac Studio.

I noticed while testing this that OS caches seem to get evicted pretty quickly; after not that many seconds, a re-run is noticeably slower. I don't understand why, but thought it was interesting.

I've not figured out how to control QoS to the degree you describe, but taskpolicy(8) offers some coarse-grained control. A (lightly edited for readability) transcript:

gecko-dev % jj version
jj 0.21.0-ac605d2e7bc71e462515f8c423fbc0437f18b363
gecko-dev % jj st
The working copy is clean
Working copy : snzmnpzp 24928d98 (empty) (no description set)
Parent commit: srvlssxw 50498861 master | Backed out changeset 58983adca2f1 (bug 1916328) for causing dt failures @ browser_parsable_css.js
gecko-dev % jj file list | wc -l
  373973
gecko-dev % echo $RAYON_NUM_THREADS

gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      4.099 s ±  0.517 s    [User: 2.834 s, System: 55.348 s]
  Range (min … max):    3.697 s …  5.237 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      6.803 s ±  0.418 s    [User: 5.987 s, System: 38.212 s]
  Range (min … max):    6.267 s …  7.599 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      6.938 s ±  0.431 s    [User: 6.578 s, System: 49.789 s]
  Range (min … max):    6.014 s …  7.399 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      4.249 s ±  0.371 s    [User: 2.839 s, System: 58.087 s]
  Range (min … max):    3.853 s …  5.065 s    10 runs

gecko-dev % export RAYON_NUM_THREADS=8
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      2.341 s ±  0.018 s    [User: 1.710 s, System: 9.140 s]
  Range (min … max):    2.319 s …  2.376 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      6.951 s ±  0.447 s    [User: 5.700 s, System: 27.704 s]
  Range (min … max):    6.319 s …  7.838 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      7.003 s ±  0.786 s    [User: 5.561 s, System: 27.330 s]
  Range (min … max):    5.456 s …  8.334 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      2.567 s ±  0.110 s    [User: 1.731 s, System: 9.194 s]
  Range (min … max):    2.366 s …  2.692 s    10 runs

gecko-dev % export RAYON_NUM_THREADS=4
gecko-dev % hyperfine --warmup 3 'jj st'
Benchmark 1: jj st
  Time (mean ± σ):      3.232 s ±  0.279 s    [User: 1.427 s, System: 5.208 s]
  Range (min … max):    2.951 s …  3.898 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c background jj st'
Benchmark 1: taskpolicy -c background jj st
  Time (mean ± σ):      9.691 s ±  0.729 s    [User: 5.024 s, System: 21.260 s]
  Range (min … max):    7.840 s … 10.256 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c maintenance jj st'
Benchmark 1: taskpolicy -c maintenance jj st
  Time (mean ± σ):      9.670 s ±  0.735 s    [User: 4.990 s, System: 21.110 s]
  Range (min … max):    8.341 s … 10.393 s    10 runs

gecko-dev % hyperfine --warmup 3 'taskpolicy -c utility jj st'
Benchmark 1: taskpolicy -c utility jj st
  Time (mean ± σ):      3.784 s ±  0.211 s    [User: 1.476 s, System: 5.713 s]
  Range (min … max):    3.454 s …  4.170 s    10 runs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐛bug Something isn't working 🍎Mac
Projects
None yet
Development

No branches or pull requests

5 participants