The Crawling Process

The crawling process for a given link consists of:

Downloading the page
Extracting links
Parsing items
Processing Items

Link Extraction

By default, roboto will extract all links from a page and add them onto the queue of pages to be crawled unless they:

Don't contain an href attribute.
Have rel="nofollow" or rel="noindex".
Don't belong to a domain listed in the crawler's allowedDomains list.
Match a rule on the crawler's blacklist.
Don't match a rule on the crawler's whitelist.
Have already been crawled

Also, pages will not be processed if the page's <head> contains a tag like:

  <meta name="robots">nofollow</meta>

robots.txt

In addition to the rules outlined above, roboto will also obey directives contained in a domain's robots.txt file. Directives are parsed as outlined here.

If the robots.txt file specifies a Crawl-Delay directive, that will be given precedence over the requestDelay option passed to the crawler constructor.

You can set the option obeyRobotsTxt to false in the constructor to disregard the rules in robots.txt files:

var fooCrawler = new roboto.Crawler({
  startUrls: [
    "http://www.foonews.com/latest",
  ],
  obeyRobotsTxt: false
});

Before roboto crawls an url, it will fetch the domain's robots.txt file, parse the directives, and skip crawling the url if a directive disallows it. The fetched robots.txt file is is then cached.

Note that roboto will fetch the robots.txt of subdomains. For example, when crawling http://news.ycombinator.com, http://news.ycombinator.com/robots.txt will be fetched, not http://ycombinator.com/robots.txt.

Url Normalization

Also known as URL canonicalization

This is the process of reducing syntactically different urls to a common simplified form. This is useful while crawling to ensure that multiple urls that point to the same page don't get crawled more than once.

By default roboto normalizes urls with the following procedure:

Unescaping url encoding /foo/%7Eexample => /foo/~example
Converting relative urls to absolute /foo.html => http://example.com/bar.html
Fully resolving paths /foo/../bar/baz.html => /bar/baz.html
Discarding fragments /foo.html#bar => /foo.html
Discarding query params /foo.html#bar => /foo.html
Discarding directory indexes /foo/index.html => /foo
- index.html, index.php, default.asp, default.aspx are all discarded.
Removing mutliple occurrences of '/' /foo//bar///baz => /foo/bar/baz
Removing trailing '/' /foo/bar/ => /foo/bar

Discarding query params all together isn't optimal. A planned enhancement is to sort query params, and possibly detect safe params to remove (sort, rows, etc.).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Crawling Process

Link Extraction

robots.txt

Url Normalization

Clone this wiki locally