Enumerate GitHub wikis concurrently #2371

rgmz · 2024-02-03T16:23:34Z

Please review the Community Note before submitting

Description

PR #2233 added the ability to scan GitHub wikis by default, as wikis are just repos and scanning them has no rate limit. Regrettably, I overlooked how dispatching a request per repo can increase startup time. This does not have a noticeable effect when scanning repos or smaller orgst, however, it does have a noticeable impact for larger orgs.

trufflehog/pkg/sources/github/repo.go

Lines 238 to 240 in 27b30e6

    
           if s.conn.GetIncludeWikis() && s.hasWiki(ctx, r, repoURL) { 
        
           	s.reposWithWikis[repoURL] = struct{}{} 
        
           }

Without this check, enumeration takes ~3s per page (I have terrible Internet):

...
2024-02-03T10:52:55-05:00       info-1  trufflehog      Enumerating with token  {"source_manager_worker_id": "cKjLf", "endpoint": "https://api.github.com"}
2024-02-03T10:52:58-05:00       info-2  trufflehog      Listed repos    {"page": 0, "last_page": 61}
2024-02-03T10:53:01-05:00       info-2  trufflehog      Listed repos    {"page": 2, "last_page": 61}
2024-02-03T10:53:04-05:00       info-2  trufflehog      Listed repos    {"page": 3, "last_page": 61}

With this check, enumeration takes closer to ~15s per page:

...
2024-02-03T10:57:03-05:00       info-1  trufflehog      Enumerating with token  {"source_manager_worker_id": "gZtkz", "endpoint": "https://api.github.com"}
2024-02-03T10:57:06-05:00       info-2  trufflehog      Listed repos    {"page": 0, "last_page": 61}
2024-02-03T10:57:20-05:00       info-2  trufflehog      Listed repos    {"page": 2, "last_page": 61}
2024-02-03T10:57:33-05:00       info-2  trufflehog      Listed repos    {"page": 3, "last_page": 61}

@rosecodym

Preferred Solution

Two potential solutions come to mind:

Run this check in a Go routine so that it does not block enumeration
Run this check when each repo is being scanned, not during initial enumeration.

trufflehog/pkg/sources/github/github.go

Lines 804 to 815 in 27b30e6

    
           if s.conn.IncludeWikis { 
        
           	if _, ok := s.reposWithWikis[repoURL]; ok { 
        
           		wikiURL := strings.TrimSuffix(repoURL, ".git") + ".wiki.git" 
        
           		wikiCtx := context.WithValue(ctx, "repo", wikiURL) 
        
           		_, err := s.cloneAndScanRepo(wikiCtx, installationClient, wikiURL, chunksChan) 
        
           		if err != nil { 
        
           			scanErrs.Add(err) 
        
           			// Don't return, it still might be possible to scan comments. 
        
           		} 
        
           	} 
        
           }

A third possibility would be to simply try to clone the wiki for any repo that has "has_wiki": true and ignore clone errors. The theory is that a HEAD request would be more efficient than attempting to clone a repo.

Additional Context

A network request is necessary because GitHub's has_wiki property lies.

#2233 (comment)

References

#2233

The text was updated successfully, but these errors were encountered:

rosecodym · 2024-02-05T15:39:44Z

I'm leaning towards option 2 ("just use another goroutine" always worries me a little), but I'll let others weigh in. While we wait for that, though, what's the latency penalty of the code as it is? Is it bad enough that we should back it out while we decide?

rgmz · 2024-02-05T15:47:43Z

Give it a try yourself and let me know. I live in a century-old building with slow Internet speed, plus WSL's networking stack constantly falls over itself. Perhaps my results are a confluence of bad luck and not reflective of a typical experience.

rosecodym · 2024-02-05T19:25:09Z

I just checked myself, and you're right, that performance cost is drastic. I'm going to open a revert PR to back it out until we figure out how we want to solve this. EDIT we could also just default the flag to off, right?

rgmz · 2024-03-29T02:10:48Z

This was fixed in #2379.
https://github.com/trufflesecurity/trufflehog/blob/283d83f113ff8762c8c301d1f3de7767c67f02e2/pkg/sources/github/github.go#L736C52-L736C67

Albeit, the flag is still disabled by default.

rgmz added the enhancement label Feb 3, 2024

This was referenced Feb 5, 2024

Revert "Scan GitHub wikis #2233" #2385

Closed

Refactor GitHub source #2379

Merged

Disable GitHub wiki scanning by default #2386

Merged

rgmz closed this as completed Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enumerate GitHub wikis concurrently #2371

Enumerate GitHub wikis concurrently #2371

rgmz commented Feb 3, 2024 •

edited

Loading

rosecodym commented Feb 5, 2024

rgmz commented Feb 5, 2024

rosecodym commented Feb 5, 2024 •

edited

Loading

rgmz commented Mar 29, 2024

Enumerate GitHub wikis concurrently #2371

Enumerate GitHub wikis concurrently #2371

Comments

rgmz commented Feb 3, 2024 • edited Loading

Description

Preferred Solution

Additional Context

References

rosecodym commented Feb 5, 2024

rgmz commented Feb 5, 2024

rosecodym commented Feb 5, 2024 • edited Loading

rgmz commented Mar 29, 2024

rgmz commented Feb 3, 2024 •

edited

Loading

rosecodym commented Feb 5, 2024 •

edited

Loading