You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I took average ( 50-60 allow/disallow rules ) robots.txt file and 10000 urls for test.
Running isAllowed() method for each url took 37 seconds in total of cpu time.
Sorry, but this is very slow.
For example, look here. 50%(!!!) of all cpu time spent on $this->isInlineDirective($rule) call.
Why can't we make that call during robots initialization and save 50% of cpu time by doing that?
protected function checkRuleSwitch($rule, $path)
{
switch ($this->isInlineDirective($rule)) {
case self::DIRECTIVE_CLEAN_PARAM:
if ($this->checkCleanParamRule($this->stripInlineDirective($rule), $path)) {
return true;
}
break;
case self::DIRECTIVE_HOST;
if ($this->checkHostRule($this->stripInlineDirective($rule))) {
return true;
}
break;
default:
if ($this->checkBasicRule($rule, $path)) {
return true;
}
}
return false;
}
Or look here.
Why can't we make call to $this->prepareRegexRule($this->encode_url($rule)) during initialization and save 30% of cpu time?
private function checkBasicRule($rule, $path)
{
$rule = $this->prepareRegexRule($this->encode_url($rule));
// change @ to \@
$escaped = strtr($rule, array('@' => '\@'));
// match result
if (preg_match('@' . $escaped . '@', $path)) {
if (strpos($escaped, '$') !== false) {
if (mb_strlen($escaped) - 1 == mb_strlen($path)) {
return true;
}
} else {
$this->log[] = 'Rule match: Path';
return true;
}
}
return false;
}
RobotsTxtParser is expected to be initialized once and used many-many times.
The text was updated successfully, but these errors were encountered:
$this->checkRuleSwitch():
Then this repo has to be rewritten from the ground and up...
There is also this initialization parsing performance issue #62
An PR is happily accepted, but I must admit this repo has more holes than a swizz cheese, so I'm not so sure it's worth fixing. An rewrite is the way to go, at least for long term...
I took average ( 50-60 allow/disallow rules ) robots.txt file and 10000 urls for test.
Running isAllowed() method for each url took 37 seconds in total of cpu time.
Sorry, but this is very slow.
For example, look here.
50%(!!!) of all cpu time spent on $this->isInlineDirective($rule) call.
Why can't we make that call during robots initialization and save 50% of cpu time by doing that?
Or look here.
Why can't we make call to $this->prepareRegexRule($this->encode_url($rule)) during initialization and save 30% of cpu time?
RobotsTxtParser is expected to be initialized once and used many-many times.
The text was updated successfully, but these errors were encountered: