Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grok is slow #75

Open
fbacchella opened this issue Nov 24, 2017 · 8 comments
Open

Grok is slow #75

fbacchella opened this issue Nov 24, 2017 · 8 comments

Comments

@fbacchella
Copy link
Contributor

I'm using Code Tools: jmh to bench grok against java's regex.

The result for the following simple code:

    private io.thekraken.grok.api.Grok grok;
    private Pattern syslog;

    @Setup
    public void prepare() throws GrokException {
        grok = new io.thekraken.grok.api.Grok();
        grok.addPattern("NONNEGINT", "\\b(?:[0-9]+)\\b");
        grok.addPattern("GREEDYDATA", ".*");
        grok.compile("^<%{NONNEGINT:syslog_pri}>%{GREEDYDATA:message}", false);
        syslog = Pattern.compile("<(?<syslogpri>\\b(?:[0-9]+)\\b)>(?<message>.*)");
    }

    @Benchmark
    public Match grokSpeed() {
        Match gm = grok.match("<1>totor");
        gm.captures();
        assert gm.toMap().get("syslog_pri") != null;
        assert gm.toMap().get("message") != null;
        return gm;
    }

    @Benchmark
    public Matcher javaRegexSpeed() {
        Matcher m = syslog.matcher("<1>totor");
        m.matches();
        assert m.group("syslogpri") != null;
        assert m.group("message") != null;
        return m;
    }

returns, on a Intel Xeon E312xx:

GrokSpeed.grokSpeed       avgt    5  1.952 ± 0.053  us/op
GrokSpeed.javaRegexSpeed  avgt    5  0.165 ± 0.005  us/op

That's 11 time slower !

The full maven project for running tests is: fbacchella/grokspeed. It's run with mvn clean package && java -jar target/grokspeed.jar

@fbacchella
Copy link
Contributor Author

I improve the grok test:

    @Benchmark
    public Match grokSpeed() {
        Match gm = grok.match("<1>totor");
        gm.captures();
        Map<String, Object> mapped = gm.toMap();
        assert mapped.get("syslog_pri") != null;
        assert mapped.get("message") != null;
        return gm;
    }

But the results are still bad:

Benchmark                 Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed       avgt    5  1.975 ± 0.012  us/op
GrokSpeed.javaRegexSpeed  avgt    5  0.255 ± 0.019  us/op

@javasoze
Copy link

@fbacchella Thanks for this benchmark tool.

We ran this with @keitaf's PR: https://github.com/paulwellnerbou/java-grok/pull/1 with our fork: https://github.com/dashbase/grokspeed/commit/7d0c7d25aaec426d2f9dd47edc4916140f22bc2a

And we see significant gains:

before:

Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  1.724 ± 0.092  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.132 ± 0.007  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  1.565 ± 0.094  us/op

after:

Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.352 ± 0.012  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.137 ± 0.006  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.341 ± 0.010  us/op

There are some API changes necc. for the optimization to work.

@fbacchella
Copy link
Contributor Author

Very nice.
The next step in Grok's performance would be able to change the regex engine used. Not all of them are created equals, as numerous benches, included mine, has shown that: https://github.com/fbacchella/RegexPerf. But the most import works has been done.

@javasoze
Copy link

Thanks @fbacchella
Any reason to choose joni over jregex? Jregex seems to be faster. Is the main reason that joni is operating on byte[] instead of String?

@fbacchella
Copy link
Contributor Author

jregex was unable to handle big regex and the last release is from 2002.

@keitaf
Copy link
Contributor

keitaf commented Mar 26, 2018

I put joni into my performance improvement branch and ran the performance test.
https://github.com/dashbase/java-grok/tree/oniguruma

java.regex

Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.332 ± 0.010  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.125 ± 0.009  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.325 ± 0.014  us/op

joni

Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.525 ± 0.016  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.124 ± 0.016  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.475 ± 0.017  us/op

Looks like joni doesn't give us performance boost, mostly because of String/UTF-16 <-> byte[]/UTF-8 conversion cost.

@fbacchella
Copy link
Contributor Author

Strange because my RegexPerf.org_joni test includes the conversion, so I should bet results similar.
But anyway, that's why I tested that too in RegexPerf: the fastest way to extract bytes from a String. Working with pure ascii if you can is much faster. I'm working with log parsing, which are mainly made from ascii, so it's intersting in this case, see https://github.com/fbacchella/LogHub/blob/master/src/main/java/loghub/processors/OnigurumaRegex.java. It tries using ascii but if it can't it tries UTF-8.
OnigurumaRegex also handle big search much better.
That's why I think a generic interface with a default implementation using java's regex would be helpful for peoples that process pure ascii or big strings.

@fbacchella
Copy link
Contributor Author

And after the updates:

# JMH version: 1.19
# VM version: JDK 1.8.0_162, VM 25.162-b12
...
Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.357 ± 0.005  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.143 ± 0.004  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.356 ± 0.004  us/op

Tested with Java 10:

# JMH version: 1.19
# VM version: JDK 10, VM 10+46
...
Benchmark                   Mode  Cnt  Score   Error  Units
GrokSpeed.grokSpeed         avgt    5  0.441 ± 0.008  us/op
GrokSpeed.javaRegexSpeed    avgt    5  0.130 ± 0.002  us/op
GrokSpeed.notrealgrokSpeed  avgt    5  0.428 ± 0.020  us/op

Java's regex improved, but grok decreased !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants