Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very poor performance if isAllowed()/isDisallowed() functions #77

Closed
ogolovanov opened this issue Sep 2, 2016 · 2 comments · Fixed by #89
Closed

Very poor performance if isAllowed()/isDisallowed() functions #77

ogolovanov opened this issue Sep 2, 2016 · 2 comments · Fixed by #89

Comments

@ogolovanov
Copy link

ogolovanov commented Sep 2, 2016

I took average ( 50-60 allow/disallow rules ) robots.txt file and 10000 urls for test.
Running isAllowed() method for each url took 37 seconds in total of cpu time.
Sorry, but this is very slow.

For example, look here.
50%(!!!) of all cpu time spent on $this->isInlineDirective($rule) call.
Why can't we make that call during robots initialization and save 50% of cpu time by doing that?


    protected function checkRuleSwitch($rule, $path)
    {
        switch ($this->isInlineDirective($rule)) {
            case self::DIRECTIVE_CLEAN_PARAM:
                if ($this->checkCleanParamRule($this->stripInlineDirective($rule), $path)) {
                    return true;
                }
                break;
            case self::DIRECTIVE_HOST;
                if ($this->checkHostRule($this->stripInlineDirective($rule))) {
                    return true;
                }
                break;
            default:
                if ($this->checkBasicRule($rule, $path)) {
                    return true;
                }
        }
        return false;
    }

Or look here.
Why can't we make call to $this->prepareRegexRule($this->encode_url($rule)) during initialization and save 30% of cpu time?

    private function checkBasicRule($rule, $path)
    {
        $rule = $this->prepareRegexRule($this->encode_url($rule));
        // change @ to \@
        $escaped = strtr($rule, array('@' => '\@'));
        // match result
        if (preg_match('@' . $escaped . '@', $path)) {
            if (strpos($escaped, '$') !== false) {
                if (mb_strlen($escaped) - 1 == mb_strlen($path)) {
                    return true;
                }
            } else {
                $this->log[] = 'Rule match: Path';
                return true;
            }
        }
        return false;
    }

RobotsTxtParser is expected to be initialized once and used many-many times.

@JanPetterMG
Copy link
Collaborator

$this->prepareRegexRule():
It was done during initialization a month ago, but not anymore. See #67 (comment).
814737c#diff-a3addb78c633d1984e69a06f8ece166cL325

$this->checkRuleSwitch():
Then this repo has to be rewritten from the ground and up...

There is also this initialization parsing performance issue #62

An PR is happily accepted, but I must admit this repo has more holes than a swizz cheese, so I'm not so sure it's worth fixing. An rewrite is the way to go, at least for long term...

@JanPetterMG
Copy link
Collaborator

Btw, if you don't care about the rendering feature, version 0.2.2 should save you 30% of the cpu time (according to your calculations).

https://github.com/t1gor/Robots.txt-Parser-Class/releases/tag/v0.2.2

@t1gor t1gor mentioned this issue Aug 30, 2021
@t1gor t1gor linked a pull request Sep 1, 2021 that will close this issue
@t1gor t1gor closed this as completed in #89 Oct 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants