Skip to content

Features and usage examples

Igor Timoshenkov edited this page Jul 21, 2017 · 10 revisions

Get started using this sample code:

require_once('vendor/autoload.php');

$parser = new RobotsTxtParser(file_get_contents('http://example.com/robots.txt'));
$parser->setUserAgent('MySimpleBot');

if ($parser->isAllowed('/')) {
	// Crawl of the frontpage is Allowed.
}
// or
if ($parser->isDisallowed('/path/to/page.html')) {
	// Crawl of /path/to/page.html is Disallowed
}

Basic features

HTTP status code

If you know the HTTP status code of the robots.txt file, input it here, otherwise simply skip it.

$parser->setHttpStatusCode(200);

Check if we're allowed in

if ($parser->isAllowed('/')) {
	// Crawl of the frontpage is Allowed
	// Do something
}
if ($parser->isDisallowed('/path/to/page.html')) {
	// Crawl of /path/to/page.html is Disallowed
	// Do something
}

Sitemaps

Export all sitemap URLs to an array (if any). Learn more.

$array = $parser->getSitemaps();

Advanced features

Crawl-delay

This is highly recommended for high-traffic crawlers

$delay = $parser->getDelay();

Simplified usage example

if ($parser->getDelay() > 0) {
	// The host doesn't want crawling more often than once every X second
	usleep($parser->getDelay() * 1000000);
	// Tip: you probably want to log the timestamp of the last crawl
	// then you'll know if you even have to sleep or not.
}

Host directive

Some webpages has mirrors, to deal with them, you'll probably want to use the Host directive (where available). That way you can make sure you crawl the host only, while ignoring any duplicate content.

$mainHost = $parser->getHost();

Clean-Param directive

Export an array of dynamic parameters that do not affect the content (e.g. identifiers of sessions, users, referrers etc.). Learn more

$cleanParam = $parser->cleanParam();

You can then parse it using:

foreach ($cleanParam as $path => $paramArray) {
	foreach ($paramArray as $param) {
		// $param - URL parameter
		// $path - URL path prefix
	}
}

Export rules

If you ever want to export the rules as an array, here is how:

// Rules for all userAgents
$rules = $parser->getRules();
// Rules for a specific useragent
$rules = $parser->getRules('mySimpleBot');

Export the robots.txt content

$content = $parser->getContent();

Get the log

$log = $parser->getLog();