Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enumerate GitHub wikis concurrently #2371

Closed
rgmz opened this issue Feb 3, 2024 · 4 comments
Closed

Enumerate GitHub wikis concurrently #2371

rgmz opened this issue Feb 3, 2024 · 4 comments

Comments

@rgmz
Copy link
Contributor

rgmz commented Feb 3, 2024

Please review the Community Note before submitting

Description

PR #2233 added the ability to scan GitHub wikis by default, as wikis are just repos and scanning them has no rate limit. Regrettably, I overlooked how dispatching a request per repo can increase startup time. This does not have a noticeable effect when scanning repos or smaller orgst, however, it does have a noticeable impact for larger orgs.

if s.conn.GetIncludeWikis() && s.hasWiki(ctx, r, repoURL) {
s.reposWithWikis[repoURL] = struct{}{}
}

Without this check, enumeration takes ~3s per page (I have terrible Internet):

...
2024-02-03T10:52:55-05:00       info-1  trufflehog      Enumerating with token  {"source_manager_worker_id": "cKjLf", "endpoint": "https://api.github.com"}
2024-02-03T10:52:58-05:00       info-2  trufflehog      Listed repos    {"page": 0, "last_page": 61}
2024-02-03T10:53:01-05:00       info-2  trufflehog      Listed repos    {"page": 2, "last_page": 61}
2024-02-03T10:53:04-05:00       info-2  trufflehog      Listed repos    {"page": 3, "last_page": 61}

With this check, enumeration takes closer to ~15s per page:

...
2024-02-03T10:57:03-05:00       info-1  trufflehog      Enumerating with token  {"source_manager_worker_id": "gZtkz", "endpoint": "https://api.github.com"}
2024-02-03T10:57:06-05:00       info-2  trufflehog      Listed repos    {"page": 0, "last_page": 61}
2024-02-03T10:57:20-05:00       info-2  trufflehog      Listed repos    {"page": 2, "last_page": 61}
2024-02-03T10:57:33-05:00       info-2  trufflehog      Listed repos    {"page": 3, "last_page": 61}

Preferred Solution

Two potential solutions come to mind:

  1. Run this check in a Go routine so that it does not block enumeration
  2. Run this check when each repo is being scanned, not during initial enumeration.

if s.conn.IncludeWikis {
if _, ok := s.reposWithWikis[repoURL]; ok {
wikiURL := strings.TrimSuffix(repoURL, ".git") + ".wiki.git"
wikiCtx := context.WithValue(ctx, "repo", wikiURL)
_, err := s.cloneAndScanRepo(wikiCtx, installationClient, wikiURL, chunksChan)
if err != nil {
scanErrs.Add(err)
// Don't return, it still might be possible to scan comments.
}
}
}

A third possibility would be to simply try to clone the wiki for any repo that has "has_wiki": true and ignore clone errors. The theory is that a HEAD request would be more efficient than attempting to clone a repo.

Additional Context

A network request is necessary because GitHub's has_wiki property lies.

#2233 (comment)

References

#2233

@rosecodym
Copy link
Collaborator

I'm leaning towards option 2 ("just use another goroutine" always worries me a little), but I'll let others weigh in. While we wait for that, though, what's the latency penalty of the code as it is? Is it bad enough that we should back it out while we decide?

@rgmz
Copy link
Contributor Author

rgmz commented Feb 5, 2024

Give it a try yourself and let me know. I live in a century-old building with slow Internet speed, plus WSL's networking stack constantly falls over itself. Perhaps my results are a confluence of bad luck and not reflective of a typical experience.

@rosecodym
Copy link
Collaborator

rosecodym commented Feb 5, 2024

I just checked myself, and you're right, that performance cost is drastic. I'm going to open a revert PR to back it out until we figure out how we want to solve this. EDIT we could also just default the flag to off, right?

@rgmz
Copy link
Contributor Author

rgmz commented Mar 29, 2024

@rgmz rgmz closed this as completed Mar 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants