Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

golang: speedup using TRIE #353

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

golang: speedup using TRIE #353

wants to merge 1 commit into from

Conversation

starius
Copy link
Contributor

@starius starius commented Apr 9, 2024

Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching.

I've applied these simplifications and modifications. There are two tables: one replaces a pattern with a list of possible search strings, while the other matches rare patterns requiring regexp with specific strings indicating their possible presence in text. The specific string is needed to know when to run the regexp.

Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format).

All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match.

Benchmark comparison:

$ benchstat old.txt new.txt
goos: linux
goarch: amd64
pkg: github.com/monperrus/crawler-user-agents
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                           │    old.txt    │               new.txt               │
                           │    sec/op     │   sec/op     vs base                │
IsCrawlerPositive-2          71.384µ ±  7%   1.535µ ± 3%  -97.85% (p=0.000 n=10)
MatchingCrawlersPositive-2   70.597µ ±  2%   1.586µ ± 1%  -97.75% (p=0.000 n=10)
IsCrawlerNegative-2          71.072µ ± 11%   1.747µ ± 4%  -97.54% (p=0.000 n=10)
MatchingCrawlersNegative-2   67.978µ ±  1%   1.723µ ± 2%  -97.47% (p=0.000 n=10)
geomean                       70.24µ         1.645µ       -97.66%

                           │    old.txt    │                 new.txt                 │
                           │      B/s      │      B/s       vs base                  │
IsCrawlerPositive-2          2.112Mi ±  7%   98.205Mi ± 3%  +4548.98% (p=0.000 n=10)
MatchingCrawlersPositive-2   2.131Mi ±  2%   95.029Mi ± 1%  +4358.39% (p=0.000 n=10)
IsCrawlerNegative-2          2.055Mi ± 10%   83.528Mi ± 4%  +3964.27% (p=0.000 n=10)
MatchingCrawlersNegative-2   2.146Mi ±  1%   84.710Mi ± 2%  +3847.78% (p=0.000 n=10)
geomean                      2.111Mi          90.14Mi       +4170.39%

New implementation is 40 times faster!

@starius starius mentioned this pull request Apr 9, 2024
@monperrus
Copy link
Owner

thanks @starius

@javierron are you able to code-review this one?

Copy link

@javierron javierron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@monperrus @starius
Very smart use of Replacer!

The main issue I see is that the list of patterns will need to be updated as more patterns are added to the file, which is not ideal.
Maybe the solution is to determine at initialization if a pattern is either literal or regex. I believe this would simplify analizePattern, and not have too much impact on performance, since most of the patterns are literals anyway.

Other than that, I leave only some general comments.

@@ -80,31 +84,198 @@ var Crawlers = func() []Crawler {
return crawlers
}()

var regexps = func() []*regexp.Regexp {
regexps := make([]*regexp.Regexp, len(Crawlers))
var pattern2literals = map[string][]string{

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works well, but requires maintenance as more patterns are added to the file

validate.go Outdated
return []string{prefix}, nil
}

mainLiternal, has := pattern2mainLiteral[pattern]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo mainLiternal -> mainLiteral

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

regexps []regexpPattern
}

var uniqueToken = hex.EncodeToString((&maphash.Hash{}).Sum(nil))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do the same with a short, deterministic string.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short string may match spontaneously. A deterministic string may match intentionally: someone attacking the detector can pass a text with that string followed by wrong format in it, causing a panic in the current implementation.


for {
uniquePos := strings.Index(replaced, uniqueToken)
if uniquePos == -1 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this check to the for declaration, it's a bit cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this, please?

for uniquePos := strings.Index(replaced, uniqueToken); uniquePos != -1; uniquePos = strings.Index(replaced, uniqueToken) {
}

Like this? I don't really like it, because it copy-pastes the strings.Index call.

for {
uniquePos := strings.Index(replaced, uniqueToken)
if uniquePos == -1 {
break

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can move this check to the for declaration, it's a bit cleaner

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above.

}

return false
}

// Finds all crawlers matching the User Agent and returns the list of their indices in Crawlers.
func MatchingCrawlers(userAgent string) []int {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of crashing, we can return error to allow the client to handle it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my comment above.

}
oldnew = append(oldnew, oldnew2...)

regexps2 := make([]regexpPattern, len(regexps))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Save memory. regexps is allocated in append calls and it is likely to have extra capacity, because append reallocates with reserves in capacity. regexps2 is allocated for exact size that is needed. We know the size only after finishing iterating over patterns, so we can not preallocate an array of the exact size in advance.

I added a comment in the code.

validate.go Outdated
})
}

replaceWith := fmt.Sprintf(" %s%c%05d ", uniqueToken, label, num)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of the index (5) should be configurable somehow, to avoid the magic 5 later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Moved to const numLen.

validate.go Outdated
}

start := uniquePos + len(uniqueToken) + 1
if start+5 >= len(replaced) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to recompute len(uniqueToken) every call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Moved to a const uniqueTokenLen.

validate.go Outdated
break
}

start := uniquePos + len(uniqueToken) + 1

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to recompute len(uniqueToken) every call

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Moved to a const uniqueTokenLen.

Most patterns are simple search strings (not special Regexp symbols).
Some utilize ^ and $, which can be emulated in plaintext search by appending
these characters to the text itself for matching as regular characters.
Additionally, some patterns involve (xx|yy) or [xY] structures, which expand
to several plaintexts. Rare patterns require real regexp matching.

I've applied these simplifications and modifications. There are two tables:
one replaces a pattern with a list of possible search strings, while the other
matches rare patterns requiring regexp with specific strings indicating their
possible presence in text. The specific string is needed to know when to run
the regexp.

Search strings are substituted with a random hex string of length 16 (to prevent
spontaneous or intentional matching with anything), followed by a label ("-" for
simple search strings, "*" for rare cases requiring regexp, and a number encoded
as "%05d" format).

All replacements are performed using strings.Replacer, which utilizes TRIE and
is therefore very fast. The random hex string is searched within the output of
the replacement. If it's not found, it indicates a mismatch. If found, it's
either a match (for simple search string labels) or a potential match (for
regexp patterns). In the latter case, the corresponding regexp is executed on
the text to verify the match.

Benchmark comparison:

$ benchstat old.txt new.txt
goos: linux
goarch: amd64
pkg: github.com/monperrus/crawler-user-agents
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                           │    old.txt    │               new.txt               │
                           │    sec/op     │   sec/op     vs base                │
IsCrawlerPositive-2          71.384µ ±  7%   1.535µ ± 3%  -97.85% (p=0.000 n=10)
MatchingCrawlersPositive-2   70.597µ ±  2%   1.586µ ± 1%  -97.75% (p=0.000 n=10)
IsCrawlerNegative-2          71.072µ ± 11%   1.747µ ± 4%  -97.54% (p=0.000 n=10)
MatchingCrawlersNegative-2   67.978µ ±  1%   1.723µ ± 2%  -97.47% (p=0.000 n=10)
geomean                       70.24µ         1.645µ       -97.66%

                           │    old.txt    │                 new.txt                 │
                           │      B/s      │      B/s       vs base                  │
IsCrawlerPositive-2          2.112Mi ±  7%   98.205Mi ± 3%  +4548.98% (p=0.000 n=10)
MatchingCrawlersPositive-2   2.131Mi ±  2%   95.029Mi ± 1%  +4358.39% (p=0.000 n=10)
IsCrawlerNegative-2          2.055Mi ± 10%   83.528Mi ± 4%  +3964.27% (p=0.000 n=10)
MatchingCrawlersNegative-2   2.146Mi ±  1%   84.710Mi ± 2%  +3847.78% (p=0.000 n=10)
geomean                      2.111Mi          90.14Mi       +4170.39%

New implementation is 40 times faster!
@starius
Copy link
Contributor Author

starius commented Apr 13, 2024

Hi @javierron !

Thank you very much for the review!
I addressed the feedback, see my comments.

Maybe the solution is to determine at initialization if a pattern is either literal or regex. I believe this would simplify analizePattern

Can you elaborate on this, please? analizePattern already determines automatically if a pattern is a literal:

prefix, complete := re.LiteralPrefix()
if complete {
  return []string{prefix}, nil
}

(I put this code after checking presence in pattern2literals table, because it is faster than compiling the regexp, so regexp compilation is not done for cases from pattern2literals table.)

We still need both tables pattern2literals and pattern2mainLiteral, because we need to do something with non-literal regexps. Maybe pattern2literals could be generated automatically, but such a code would be quite complicated, so I think it is not worth it, to generate such a small table. And for pattern2mainLiteral I don't know how to automate its creation even in theory: we need to find a long enough literal in regexp which is present in any match. It requires in-depth analysis of the structure of regexp. And the hardcoded table is even smaller...

Do you have ideas how to do it in an elegant way?

@starius starius requested a review from javierron April 17, 2024 14:45
@monperrus
Copy link
Owner

@starius thanks for the updates

@javierron let me know how we should proceed thanks.

@javierron
Copy link

Hi @starius Thanks for the response.

I mean, we could have two sets: one set for literals, which are checked against using the trie solution; and another set for actual patterns, which are checked against by using a regex chain (regex1|regex2|...|regexn). This would avoid the issue of having to update the program every time a new crawler with a pattern is added.

@monperrus Does that make any sense?

@starius
Copy link
Contributor Author

starius commented May 16, 2024

Hi @javierron @monperrus !

I think it is possible to implement func analyzePattern without hardcoded tables. Go has package regexp/syntax which provides introspection of regular expressions. I think it is possible to use it to convert a pattern to the list of all possible matching strings (what pattern2literals does) and to extract longest common sub-string from a regexp (what pattern2mainLiteral does). I'll try to implement this in few days and update the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants