-
Notifications
You must be signed in to change notification settings - Fork 31
Features and usage examples
Igor Timoshenkov edited this page Jul 21, 2017
·
10 revisions
require_once('vendor/autoload.php');
$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
$parser->setUserAgent('MySimpleBot');
if ($parser->isAllowed('/')) {
// Crawl of the frontpage is Allowed.
}
// or
if ($parser->isDisallowed('/path/to/page.html')) {
// Crawl of /path/to/page.html is Disallowed
}
If you know the HTTP status code of the robots.txt file, input it here, otherwise simply skip it.
$parser->setHttpStatusCode(200);
if ($parser->isAllowed('/')) {
// Crawl of the frontpage is Allowed
// Do something
}
if ($parser->isDisallowed('/path/to/page.html')) {
// Crawl of /path/to/page.html is Disallowed
// Do something
}
Export all sitemap URLs to an array (if any). Learn more.
$array = $parser->getSitemaps();
This is highly recommended for high-traffic crawlers
$delay = $parser->getDelay();
Simplified usage example
if ($parser->getDelay() > 0) {
// The host doesn't want crawling more often than once every X second
usleep($parser->getDelay() * 1000000);
// Tip: you probably want to log the timestamp of the last crawl
// then you'll know if you even have to sleep or not.
}
Some webpages has mirrors, to deal with them, you'll probably want to use the Host directive (where available). That way you can make sure you crawl the host only, while ignoring any duplicate content.
$mainHost = $parser->getHost();
Export an array of dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.). Learn more
$cleanParam = $parser->cleanParam();
You can then parse it using:
foreach ($cleanParam as $path => $paramArray) {
foreach ($paramArray as $param) {
// $param - URL parameter
// $path - URL path prefix
}
}
If you ever want to export the rules as an array, here is how:
// Rules for all userAgents
$rules = $parser->getRules();
// Rules for a specific useragent
$rules = $parser->getRules('mySimpleBot');
$content = $parser->getContent();
$log = $parser->getLog();