Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Most patterns are simple search strings (not special Regexp symbols). Some utilize ^ and $, which can be emulated in plaintext search by appending these characters to the text itself for matching as regular characters. Additionally, some patterns involve (xx|yy) or [xY] structures, which expand to several plaintexts. Rare patterns require real regexp matching. I've applied these simplifications and modifications. There are two tables: one replaces a pattern with a list of possible search strings, while the other matches rare patterns requiring regexp with specific strings indicating their possible presence in text. The specific string is needed to know when to run the regexp. Search strings are substituted with a random hex string of length 16 (to prevent spontaneous or intentional matching with anything), followed by a label ("-" for simple search strings, "*" for rare cases requiring regexp, and a number encoded as "%05d" format). All replacements are performed using strings.Replacer, which utilizes TRIE and is therefore very fast. The random hex string is searched within the output of the replacement. If it's not found, it indicates a mismatch. If found, it's either a match (for simple search string labels) or a potential match (for regexp patterns). In the latter case, the corresponding regexp is executed on the text to verify the match. Benchmark comparison: $ benchstat old.txt new.txt goos: linux goarch: amd64 pkg: github.com/monperrus/crawler-user-agents cpu: Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ IsCrawlerPositive-2 71.384µ ± 7% 1.535µ ± 3% -97.85% (p=0.000 n=10) MatchingCrawlersPositive-2 70.597µ ± 2% 1.586µ ± 1% -97.75% (p=0.000 n=10) IsCrawlerNegative-2 71.072µ ± 11% 1.747µ ± 4% -97.54% (p=0.000 n=10) MatchingCrawlersNegative-2 67.978µ ± 1% 1.723µ ± 2% -97.47% (p=0.000 n=10) geomean 70.24µ 1.645µ -97.66% │ old.txt │ new.txt │ │ B/s │ B/s vs base │ IsCrawlerPositive-2 2.112Mi ± 7% 98.205Mi ± 3% +4548.98% (p=0.000 n=10) MatchingCrawlersPositive-2 2.131Mi ± 2% 95.029Mi ± 1% +4358.39% (p=0.000 n=10) IsCrawlerNegative-2 2.055Mi ± 10% 83.528Mi ± 4% +3964.27% (p=0.000 n=10) MatchingCrawlersNegative-2 2.146Mi ± 1% 84.710Mi ± 2% +3847.78% (p=0.000 n=10) geomean 2.111Mi 90.14Mi +4170.39% New implementation is 40 times faster!
- Loading branch information