Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Golang package #348

Merged
merged 23 commits into from
Apr 5, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/ci-validation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,3 +23,4 @@ jobs:
- run: py.test -vv
- run: python3 validate.py
- run: php validate.php
- run: go test
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,11 @@ Each `pattern` is a regular expression. It should work out-of-the-box wih your f
* JavaScript: `if (RegExp(entry.pattern).test(req.headers['user-agent']) { ... }`
* PHP: add a slash before and after the pattern: `if (preg_match('/'.$entry['pattern'].'/', $_SERVER['HTTP_USER_AGENT'])): ...`
* Python: `if re.search(entry['pattern'], ua): ...`
* Go: use [this package](https://pkg.go.dev/github.com/monperrus/crawler-user-agents),
it provides global variable `Crawlers` (it is synchronized with `crawler-user-agents.json`),
functions `IsCrawler` and `MatchingCrawlers`. To achieve the best performance possible in functions
`IsCrawler` and `MatchingCrawlers`, install C++ RE2 into your system: `sudo apt-get install libre2-dev`
and pass tag: `-tags re2_cgo`.

## Contributing

Expand Down Expand Up @@ -66,7 +71,6 @@ There are a few wrapper libraries that use this data to detect bots:
* [Voight-Kampff](https://github.com/biola/Voight-Kampff) (Ruby)
* [isbot](https://github.com/Hentioe/isbot) (Ruby)
* [crawlers](https://github.com/Olical/crawlers) (Clojure)
* [crawlerflagger](https://godoc.org/go.kelfa.io/kelfa/pkg/crawlerflagger) (Go)
* [isBot](https://github.com/omrilotan/isbot) (Node.JS)

Other systems for spotting robots, crawlers, and spiders that you may want to consider are:
Expand Down
16 changes: 16 additions & 0 deletions go.mod
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
module github.com/monperrus/crawler-user-agents

go 1.19

require (
github.com/stretchr/testify v1.9.0
github.com/wasilibs/go-re2 v1.5.1
)

require (
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/magefile/mage v1.14.0 // indirect
github.com/pmezard/go-difflib v1.0.0 // indirect
github.com/tetratelabs/wazero v1.7.0 // indirect
gopkg.in/yaml.v3 v3.0.1 // indirect
)
15 changes: 15 additions & 0 deletions go.sum
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c=
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/magefile/mage v1.14.0 h1:6QDX3g6z1YvJ4olPhT1wksUcSa/V0a1B+pJb73fBjyo=
github.com/magefile/mage v1.14.0/go.mod h1:z5UZb/iS3GoOSn0JgWuiw7dxlurVYTu+/jHXqQg881A=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
github.com/pmezard/go-difflib v1.0.0/go.mod h1:iKH77koFhYxTK1pcRnkKkqfTogsbg7gZNVY4sRDYZ/4=
github.com/stretchr/testify v1.9.0 h1:HtqpIVDClZ4nwg75+f6Lvsy/wHu+3BoSGCbBAcpTsTg=
github.com/stretchr/testify v1.9.0/go.mod h1:r2ic/lqez/lEtzL7wO/rwa5dbSLXVDPFyf8C91i36aY=
github.com/tetratelabs/wazero v1.7.0 h1:jg5qPydno59wqjpGrHph81lbtHzTrWzwwtD4cD88+hQ=
github.com/tetratelabs/wazero v1.7.0/go.mod h1:ytl6Zuh20R/eROuyDaGPkp82O9C/DJfXAwJfQ3X6/7Y=
github.com/wasilibs/go-re2 v1.5.1 h1:a+Gb1mx6Q7MmU4d+3BCnnN28U2/cnADmY1oRRanQi10=
github.com/wasilibs/go-re2 v1.5.1/go.mod h1:UqqxQ1O99boQUm1r61H/IYGiGQOS/P88K7hU5nLNkEg=
gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405/go.mod h1:Co6ibVJAznAaIkqp8huTwlJQCZ016jof/cbN4VW5Yz0=
gopkg.in/yaml.v3 v3.0.1 h1:fxVm/GzAzEWqLHuvctI91KS9hhNmmWOoWu0XTYJS7CA=
gopkg.in/yaml.v3 v3.0.1/go.mod h1:K4uyk7z7BCEPqu6E+C64Yfv1cQ7kz7rIZviUmN+EgEM=
183 changes: 183 additions & 0 deletions validate.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
package agents

import (
_ "embed"
"encoding/json"
"fmt"
"strings"
"time"

regexp "github.com/wasilibs/go-re2"
)

//go:embed crawler-user-agents.json
var crawlersJson []byte

// Crawler contains information about one crawler.
type Crawler struct {
// Regexp of User Agent of the crawler.
Pattern string `json:"pattern"`

// Discovery date.
AdditionDate time.Time `json:"addition_date"`

// Official url of the robot.
URL string `json:"url"`

// Examples of full User Agent strings.
Instances []string `json:"instances"`
}

// Private time needed to convert addition_date from/to the format used in JSON.
type jsonCrawler struct {
Pattern string `json:"pattern"`
AdditionDate string `json:"addition_date"`
URL string `json:"url"`
Instances []string `json:"instances"`
}

const timeLayout = "2006/01/02"

func (c Crawler) MarshalJSON() ([]byte, error) {
jc := jsonCrawler{
Pattern: c.Pattern,
AdditionDate: c.AdditionDate.Format(timeLayout),
URL: c.URL,
Instances: c.Instances,
}
return json.Marshal(jc)
}

func (c *Crawler) UnmarshalJSON(b []byte) error {
var jc jsonCrawler
if err := json.Unmarshal(b, &jc); err != nil {
return err
}

c.Pattern = jc.Pattern
c.URL = jc.URL
c.Instances = jc.Instances

if c.Pattern == "" {
return fmt.Errorf("empty pattern in record %s", string(b))
}

if jc.AdditionDate != "" {
tim, err := time.ParseInLocation(timeLayout, jc.AdditionDate, time.UTC)
if err != nil {
return err
}
c.AdditionDate = tim
}

return nil
}

// The list of crawlers, built from contents of crawler-user-agents.json.
var Crawlers = func() []Crawler {
var crawlers []Crawler
if err := json.Unmarshal(crawlersJson, &crawlers); err != nil {
panic(err)
}
return crawlers
}()

func joinRes(begin, end int) string {
regexps := make([]string, 0, len(Crawlers))
for _, crawler := range Crawlers[begin:end] {
regexps = append(regexps, "("+crawler.Pattern+")")
}
return strings.Join(regexps, "|")
}

var allRegexps = joinRes(0, len(Crawlers))

var allRegexpsRe = regexp.MustCompile(allRegexps)

// Returns if User Agent string matches any of crawler patterns.
func IsCrawler(userAgent string) bool {
return allRegexpsRe.MatchString(userAgent)
}

// With RE2 it is fast to check the text against a large regexp.
// To find matching regexps faster, built a binary tree of regexps.

type regexpNode struct {
re *regexp.Regexp
left *regexpNode
right *regexpNode
index int
}

var regexpsTree = func() *regexpNode {
nodes := make([]*regexpNode, len(Crawlers))
starts := make([]int, len(Crawlers)+1)
for i, crawler := range Crawlers {
nodes[i] = &regexpNode{
re: regexp.MustCompile(crawler.Pattern),
index: i,
}
starts[i] = i
}
starts[len(Crawlers)] = len(Crawlers) // To get end of interval.

for len(nodes) > 1 {
// Join into pairs.
nodes2 := make([]*regexpNode, (len(nodes)+1)/2)
starts2 := make([]int, 0, len(nodes2)+1)
for i := 0; i < len(nodes)/2; i++ {
leftIndex := 2 * i
rightIndex := 2*i + 1
nodes2[i] = &regexpNode{
left: nodes[leftIndex],
right: nodes[rightIndex],
}
if len(nodes2) != 1 {
// Skip regexp for root node, it is not used.
joinedRe := joinRes(starts[leftIndex], starts[rightIndex+1])
nodes2[i].re = regexp.MustCompile(joinedRe)
}
starts2 = append(starts2, starts[leftIndex])
}
if len(nodes)%2 == 1 {
nodes2[len(nodes2)-1] = nodes[len(nodes)-1]
starts2 = append(starts2, starts[len(starts)-2])
}
starts2 = append(starts2, starts[len(starts)-1])

nodes = nodes2
starts = starts2
}

root := nodes[0]

if root.left == nil {
panic("the algoriths does not work with just one regexp")
}

return root
}()

// Finds all crawlers matching the User Agent and returns the list of their indices in Crawlers.
func MatchingCrawlers(userAgent string) []int {
indices := []int{}

var visit func(node *regexpNode)
visit = func(node *regexpNode) {
if node.left != nil {
if node.left.re.MatchString(userAgent) {
visit(node.left)
}
if node.right.re.MatchString(userAgent) {
visit(node.right)
}
} else {
// Leaf.
indices = append(indices, node.index)
}
}

visit(regexpsTree)

return indices
}
45 changes: 45 additions & 0 deletions validate_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
package agents

import (
"fmt"
"testing"

"github.com/stretchr/testify/require"
)

func TestPatterns(t *testing.T) {
// loading all crawlers wwith go:embed
// some validation happens in UnmarshalJSON
allCrawlers := Crawlers

// there is at least 10 crawlers
require.GreaterOrEqual(t, len(allCrawlers), 10)

for i, crawler := range allCrawlers {
t.Run(crawler.URL, func(t *testing.T) {
// print pattern to console for quickcheck in CI
fmt.Print(crawler.Pattern)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use fmt.Println to print each pattern on a separate line.

Also maybe it is better to put crawler.Pattern as subtest name (first argument of t.Run) and run with go test -v, it will print each subtest name (which would be a pattern).

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great idea, could you do it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I pushed to the branch.


for _, instance := range crawler.Instances {
require.True(t, IsCrawler(instance), instance)
require.Contains(t, MatchingCrawlers(instance), i, instance)
}
})
}
}

func BenchmarkIsCrawler(b *testing.B) {
userAgent := "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Google-PageRenderer Google (+https://developers.google.com/+/web/snippet/)"
b.SetBytes(int64(len(userAgent)))
for n := 0; n < b.N; n++ {
IsCrawler(userAgent)
}
}

func BenchmarkMatchingCrawlers(b *testing.B) {
userAgent := "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36 Google-PageRenderer Google (+https://developers.google.com/+/web/snippet/)"
b.SetBytes(int64(len(userAgent)))
for n := 0; n < b.N; n++ {
MatchingCrawlers(userAgent)
}
}