Git speedups for large repos #207

gjost · 2022-01-24T19:41:39Z

Spend no more than 2 days on this.

The ddr-densho-1000 is really huge and this causes usability problems even when the repo is checked out locally. In particular, git status takes forever to run.
Repo has tons of files and also a long history (~4000 commits).

IDEA cp ddr-densho-1000 ddr-densho-1000new, remove .git/, git init
where does the slow come from?
TODO research git performance (num objects, size, repo age)
TODO can we set git caching interval?
TODO profile git operations
does not correlate to number of objects of phsyical size of repo
seems to be length commit history

Ways to improve git status performance (2012)
https://stackoverflow.com/questions/4994772/ways-to-improve-git-status-performance
10 GB repo on NFS on Linux. First time git status ~36min, subsequent 8min

Slow Git Performance (2021)
https://support.purestorage.com/Knowledge_Base/FlashBlade_KB/Slow_Git_Performance

OPTIONS

Shallow clone
git clone --depth=50 --no-single-branch COLLECTION

Sparse checkout
https://github.blog/2020-01-17-bring-your-monorepo-down-to-size-with-sparse-checkout/
git clone COLLECTION
git sparse-checkout init --cone
git sparse-checkout set ...

Partial checkouts
https://github.blog/2020-12-21-get-up-to-speed-with-partial-clone-and-shallow-clone/
Blobless clones: git clone --filter=blob:none
Treeless clones: git clone --filter=tree:0

TODO Test shallow,sparse clones
TODO test on Dana's machine

The text was updated successfully, but these errors were encountered:

gjost · 2022-06-17T15:28:11Z

We Put Half a Million files in One git Repository, Here’s What We Learned
https://canvatechblog.com/we-put-half-a-million-files-in-one-git-repository-heres-what-we-learned-ec734a764181
To reduce the amount of work git needs to do to find changes, we used the fsmonitor hook with Watchman so we capture changes as they happen instead of having to scan all files in the repository every time a command is run.
We also enabled feature.manyFiles, which under the hood enables the untracked cache to skip directories and files that haven’t been modified.
Git also has a built-in command (maintenance) to optimize a repository’s data, speeding up commands and reducing disk space. This isn’t enabled by default, so we register it with a schedule for daily and hourly routines.
Sparse checkout
If an engineer can tell us what they usually work on, we can craft a checkout pattern that includes all the required dependencies to run and test their code locally while keeping the checkout as small as possible.
Sparse checkout drawbacks:

Tracked files not physically populated on disk can’t be searched through or interacted with. Accidental changes or an erroneous merge conflict might leave these files in a bad state.
Overhead to every git checkout to check if updated file should be populated or ignored. This overhead is small with simple patterns but becomes significant with more complex ones.

https://news.ycombinator.com/item?id=31762245
Interesting

The Case Against Monorepos (Infoworld)

Trunk-Based Development: Monorepos (https://trunkbaseddevelopment.com/monorepos)
monorepo.tools - Everything you need to know about monorepos, and the tools to build them (https://monorepo.tools)

gjost added the question label Jan 24, 2022

gjost self-assigned this Jan 24, 2022

gjost added the WORKING label Jun 17, 2022

gjost removed the WORKING label Mar 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Git speedups for large repos #207

Git speedups for large repos #207

gjost commented Jan 24, 2022 •

edited

Loading

gjost commented Jun 17, 2022 •

edited

Loading

Git speedups for large repos #207

Git speedups for large repos #207

Comments

gjost commented Jan 24, 2022 • edited Loading

gjost commented Jun 17, 2022 • edited Loading

gjost commented Jan 24, 2022 •

edited

Loading

gjost commented Jun 17, 2022 •

edited

Loading