golang: speedup using TRIE #353

starius · 2024-04-09T01:17:51Z

Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching.

I've applied these simplifications and modifications. There are two tables: one replaces a pattern with a list of possible search strings, while the other matches rare patterns requiring regexp with specific strings indicating their possible presence in text. The specific string is needed to know when to run the regexp.

Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format).

All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match.

Benchmark comparison:

$ benchstat old.txt new.txt
goos: linux
goarch: amd64
pkg: github.com/monperrus/crawler-user-agents
cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz
                           │    old.txt    │               new.txt               │
                           │    sec/op     │   sec/op     vs base                │
IsCrawlerPositive-2          71.384µ ±  7%   1.535µ ± 3%  -97.85% (p=0.000 n=10)
MatchingCrawlersPositive-2   70.597µ ±  2%   1.586µ ± 1%  -97.75% (p=0.000 n=10)
IsCrawlerNegative-2          71.072µ ± 11%   1.747µ ± 4%  -97.54% (p=0.000 n=10)
MatchingCrawlersNegative-2   67.978µ ±  1%   1.723µ ± 2%  -97.47% (p=0.000 n=10)
geomean                       70.24µ         1.645µ       -97.66%

                           │    old.txt    │                 new.txt                 │
                           │      B/s      │      B/s       vs base                  │
IsCrawlerPositive-2          2.112Mi ±  7%   98.205Mi ± 3%  +4548.98% (p=0.000 n=10)
MatchingCrawlersPositive-2   2.131Mi ±  2%   95.029Mi ± 1%  +4358.39% (p=0.000 n=10)
IsCrawlerNegative-2          2.055Mi ± 10%   83.528Mi ± 4%  +3964.27% (p=0.000 n=10)
MatchingCrawlersNegative-2   2.146Mi ±  1%   84.710Mi ± 2%  +3847.78% (p=0.000 n=10)
geomean                      2.111Mi          90.14Mi       +4170.39%

New implementation is 40 times faster!

monperrus · 2024-04-09T06:17:41Z

thanks @starius

@javierron are you able to code-review this one?

javierron

@monperrus @starius
Very smart use of Replacer!

The main issue I see is that the list of patterns will need to be updated as more patterns are added to the file, which is not ideal.
Maybe the solution is to determine at initialization if a pattern is either literal or regex. I believe this would simplify analizePattern, and not have too much impact on performance, since most of the patterns are literals anyway.

Other than that, I leave only some general comments.

javierron · 2024-04-11T15:08:39Z

validate.go

@@ -80,31 +84,198 @@ var Crawlers = func() []Crawler {
 	return crawlers
 }()

-var regexps = func() []*regexp.Regexp {
-	regexps := make([]*regexp.Regexp, len(Crawlers))
+var pattern2literals = map[string][]string{


This works well, but requires maintenance as more patterns are added to the file

javierron · 2024-04-11T15:08:48Z

validate.go

+		return []string{prefix}, nil
+	}
+
+	mainLiternal, has := pattern2mainLiteral[pattern]


typo mainLiternal -> mainLiteral

javierron · 2024-04-11T15:08:51Z

validate.go

+	regexps  []regexpPattern
+}
+
+var uniqueToken = hex.EncodeToString((&maphash.Hash{}).Sum(nil))


We can do the same with a short, deterministic string.

A short string may match spontaneously. A deterministic string may match intentionally: someone attacking the detector can pass a text with that string followed by wrong format in it, causing a panic in the current implementation.

javierron · 2024-04-11T15:10:33Z

validate.go

+
+	for {
+		uniquePos := strings.Index(replaced, uniqueToken)
+		if uniquePos == -1 {


We can move this check to the for declaration, it's a bit cleaner

Can you elaborate on this, please?

for uniquePos := strings.Index(replaced, uniqueToken); uniquePos != -1; uniquePos = strings.Index(replaced, uniqueToken) { }

Like this? I don't really like it, because it copy-pastes the strings.Index call.

javierron · 2024-04-11T15:11:10Z

validate.go

+	for {
+		uniquePos := strings.Index(replaced, uniqueToken)
+		if uniquePos == -1 {
+			break


We can move this check to the for declaration, it's a bit cleaner

See my comment above.

javierron · 2024-04-11T15:14:01Z

validate.go

 	}
+
 	return false
 }

 // Finds all crawlers matching the User Agent and returns the list of their indices in Crawlers.
 func MatchingCrawlers(userAgent string) []int {


Instead of crashing, we can return error to allow the client to handle it.

See my comment above.

javierron · 2024-04-11T15:17:55Z

validate.go

+	}
+	oldnew = append(oldnew, oldnew2...)
+
+	regexps2 := make([]regexpPattern, len(regexps))


Why is this necessary?

Save memory. regexps is allocated in append calls and it is likely to have extra capacity, because append reallocates with reserves in capacity. regexps2 is allocated for exact size that is needed. We know the size only after finishing iterating over patterns, so we can not preallocate an array of the exact size in advance.

I added a comment in the code.

javierron · 2024-04-11T15:20:04Z

validate.go

+			})
+		}
+
+		replaceWith := fmt.Sprintf(" %s%c%05d ", uniqueToken, label, num)


The size of the index (5) should be configurable somehow, to avoid the magic 5 later

Fixed. Moved to const numLen.

javierron · 2024-04-11T15:20:55Z

validate.go

+		}
+
+		start := uniquePos + len(uniqueToken) + 1
+		if start+5 >= len(replaced) {


No need to recompute len(uniqueToken) every call

Fixed. Moved to a const uniqueTokenLen.

javierron · 2024-04-11T15:21:07Z

validate.go

+			break
+		}
+
+		start := uniquePos + len(uniqueToken) + 1


No need to recompute len(uniqueToken) every call

Fixed. Moved to a const uniqueTokenLen.

Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching. I've applied these simplifications and modifications. There are two tables: one replaces a pattern with a list of possible search strings, while the other matches rare patterns requiring regexp with specific strings indicating their possible presence in text. The specific string is needed to know when to run the regexp. Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format). All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match. Benchmark comparison: $ benchstat old.txt new.txt goos: linux goarch: amd64 pkg: github.com/monperrus/crawler-user-agents cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ IsCrawlerPositive-2 71.384µ ± 7% 1.535µ ± 3% -97.85% (p=0.000 n=10) MatchingCrawlersPositive-2 70.597µ ± 2% 1.586µ ± 1% -97.75% (p=0.000 n=10) IsCrawlerNegative-2 71.072µ ± 11% 1.747µ ± 4% -97.54% (p=0.000 n=10) MatchingCrawlersNegative-2 67.978µ ± 1% 1.723µ ± 2% -97.47% (p=0.000 n=10) geomean 70.24µ 1.645µ -97.66% │ old.txt │ new.txt │ │ B/s │ B/s vs base │ IsCrawlerPositive-2 2.112Mi ± 7% 98.205Mi ± 3% +4548.98% (p=0.000 n=10) MatchingCrawlersPositive-2 2.131Mi ± 2% 95.029Mi ± 1% +4358.39% (p=0.000 n=10) IsCrawlerNegative-2 2.055Mi ± 10% 83.528Mi ± 4% +3964.27% (p=0.000 n=10) MatchingCrawlersNegative-2 2.146Mi ± 1% 84.710Mi ± 2% +3847.78% (p=0.000 n=10) geomean 2.111Mi 90.14Mi +4170.39% New implementation is 40 times faster!

starius · 2024-04-13T02:52:06Z

Hi @javierron !

Thank you very much for the review!
I addressed the feedback, see my comments.

Maybe the solution is to determine at initialization if a pattern is either literal or regex. I believe this would simplify analizePattern

Can you elaborate on this, please? analizePattern already determines automatically if a pattern is a literal:

prefix, complete := re.LiteralPrefix()
if complete {
  return []string{prefix}, nil
}

(I put this code after checking presence in pattern2literals table, because it is faster than compiling the regexp, so regexp compilation is not done for cases from pattern2literals table.)

We still need both tables pattern2literals and pattern2mainLiteral, because we need to do something with non-literal regexps. Maybe pattern2literals could be generated automatically, but such a code would be quite complicated, so I think it is not worth it, to generate such a small table. And for pattern2mainLiteral I don't know how to automate its creation even in theory: we need to find a long enough literal in regexp which is present in any match. It requires in-depth analysis of the structure of regexp. And the hardcoded table is even smaller...

Do you have ideas how to do it in an elegant way?

monperrus · 2024-05-03T05:05:50Z

@starius thanks for the updates

@javierron let me know how we should proceed thanks.

javierron · 2024-05-06T17:16:49Z

Hi @starius Thanks for the response.

I mean, we could have two sets: one set for literals, which are checked against using the trie solution; and another set for actual patterns, which are checked against by using a regex chain (regex1|regex2|...|regexn). This would avoid the issue of having to update the program every time a new crawler with a pattern is added.

@monperrus Does that make any sense?

starius · 2024-05-16T11:57:25Z

Hi @javierron @monperrus !

I think it is possible to implement func analyzePattern without hardcoded tables. Go has package regexp/syntax which provides introspection of regular expressions. I think it is possible to use it to convert a pattern to the list of all possible matching strings (what pattern2literals does) and to extract longest common sub-string from a regexp (what pattern2mainLiteral does). I'll try to implement this in few days and update the PR.

starius mentioned this pull request Apr 9, 2024

Add Golang package #348

Merged

javierron reviewed Apr 12, 2024

View reviewed changes

starius force-pushed the trie branch from 9f3efa9 to cf7b3da Compare April 13, 2024 02:38

starius requested a review from javierron April 17, 2024 14:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

golang: speedup using TRIE #353

golang: speedup using TRIE #353

starius commented Apr 9, 2024

monperrus commented Apr 9, 2024

javierron left a comment

javierron Apr 11, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

javierron Apr 11, 2024

starius Apr 13, 2024

starius commented Apr 13, 2024

monperrus commented May 3, 2024

javierron commented May 6, 2024

starius commented May 16, 2024

golang: speedup using TRIE #353

Are you sure you want to change the base?

golang: speedup using TRIE #353

Conversation

starius commented Apr 9, 2024

monperrus commented Apr 9, 2024

javierron left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

starius commented Apr 13, 2024

monperrus commented May 3, 2024

javierron commented May 6, 2024

starius commented May 16, 2024