Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Golang package #348

Merged
merged 23 commits into from
Apr 5, 2024
Merged

Add Golang package #348

merged 23 commits into from
Apr 5, 2024

Conversation

starius
Copy link
Contributor

@starius starius commented Apr 3, 2024

Golang package embeds the JSON file with patterns using Go's go:embed feature. Go package is kept in sync automatically with the JSON file. No manual updates of Go package are needed to keep Go package in sync.

The JSON file is parsed at load time of Go package and exposed in API as Go list of type Crawler. Functions IsCrawler and MatchingCrawlers provide a way to check User Agent if it is a crawler. The functions use go-re2 library to run regexps to achieve high speed compared to standard library regexp engine. I implemented function MatchingCrawlers in a smart way to improve performance: I combine regexps into a binary tree and use it when searching. Since RE2 works faster on large regexps than individually on each regexp, it brings speed-up.

I also provided Github workflow to run tests and banchmarks of Go package on each push.

To achieve the best performance possible in functions IsCrawler and MatchingCrawlers, install C++ RE2 into your system:

sudo apt-get install libre2-dev

and pass tag: -tags re2_cgo

They prevented parsing in Go.
Performance increase is huge!

Go regexp:                              0.05 MB/s
go-re2 in pure Go mode:                77.84 MB/s
go-re2 using C++ Re2 (-tags re2_cgo): 213.85 MB/s

To enable C++ Re2, install it:
sudo apt-get install libre2-dev
and pass -tags re2_cgo build tag.
Re2 is fast on large Regexps (faster than when running individually
on each RE, including with Go regexp). I used this fact to find matching
regexps using tree of regexps of concatenated parts patterns. The
individual regexps are found by going from root node of the tree to down.

Benchmark results BenchmarkMatchingCrawlers:
Before this commit (Re2 individually, pure Go):       0.32 MB/s
Before this commit (Re2 individually, -tags re2_cgo): 1.32 MB/s
If Go regexp is used individually:                    2.31 MB/s
With this commit (Re2, pure Go):                      5.90 MB/s
With this commit (Re2, -tags re2_cgo):               18.24 MB/s

Maybe it can be improved even better with hyperscan, but I don't
want to bring another dependency.
@starius
Copy link
Contributor Author

starius commented Apr 3, 2024

Github actions workflow: https://github.com/starius/crawler-user-agents/actions/workflows/golang.yml
Please enable it in the repo.

@monperrus monperrus mentioned this pull request Apr 4, 2024
validate_test.go Outdated
t.Run(crawler.URL, func(t *testing.T) {
// print pattern to console for quickcheck in CI
fmt.Print(crawler.Pattern)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use fmt.Println to print each pattern on a separate line.

Also maybe it is better to put crawler.Pattern as subtest name (first argument of t.Run) and run with go test -v, it will print each subtest name (which would be a pattern).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea, could you do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I pushed to the branch.

@monperrus
Copy link
Owner

Thanks a lot for the great contribution!

I've asked for PR approval by Go experts.

@starius
Copy link
Contributor Author

starius commented Apr 4, 2024

Added an example of Go program and fixed copy-paste in Go benchmark.

@monperrus
Copy link
Owner

Thanks a lot @starius I really appreciate.

We really worry about software supply chain security for crawler-user-agents (cc/ @ericcornelissen @javierron), and we would like to keep minimal external dependencies.

In particular, I'd like to remove dependency to stretchr/testify and to tetratelabs/wazero (an entire runtime).

If this means moving from wasilibs/go-re2 to the Go standard regexp, we probably have to do this.

What do you think?

See monperrus#348 (comment)

Also, it turned out to be faster if regexps are checked individually,
not as one large |-concatenation of regexps. One regexp check consumes
66 microseconds on Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz.
@starius
Copy link
Contributor Author

starius commented Apr 4, 2024

Thank you for feedback!

I removed stretchr/testify, it was used only it tests.

I acknowledge the problems with wazero and re2. I just caught a crash in re2, related to wazero! I switched back to Go standard regexp. It turned out to be not as bad, if regexps are checked one by one, not one regexp for all patterns. One IsCrawler call consumes 66 microseconds on Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz.

@starius
Copy link
Contributor Author

starius commented Apr 5, 2024

I pushed another commit to check against false positives. It fixes #350

@monperrus monperrus merged commit 951462f into monperrus:master Apr 5, 2024
2 checks passed
@monperrus
Copy link
Owner

great, many thanks @starius

@monperrus
Copy link
Owner

Hi @starius

Afterthought of @javierron: the way the regex is written, we still need to do n regex matches when matching against the two depth=1 nodes (and then some more). Maybe a trie based join approach would be better?

WDYT?

@starius
Copy link
Contributor Author

starius commented Apr 8, 2024

Hi @monperrus !

Using TRIE looks good to me!

The only TRIE implementation in Go standard library I am aware of is https://pkg.go.dev/strings#NewReplacer
We can make a replacer replacing all the patterns with an empty string and run it over the user agent. If the string changes - it means something matches. For MatchingCrawlers we can replace with some unique prefix followed by crawler ID and then extract it.

The problem is that some regexps are not just search strings, but actually use regexp syntax, e.g. "Ahrefs(Bot|SiteAudit)", "AdsBot-Google([^-]|$)", "S[eE][mM]rushBot" etc
Some of them can be turned into series of strings, e.g. "Ahrefs(Bot|SiteAudit)" => "AhrefsBot", ""AhrefsSiteAudit" and added to TRIE as separate items. The small minority of complex patterns can be checked as regexps.

@starius
Copy link
Contributor Author

starius commented Apr 9, 2024

@monperrus See #353

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants