Skip to content

The Crawling Process

James Culveyhouse edited this page Aug 24, 2014 · 4 revisions

The crawling process for a given link consists of:

  1. Downloading the page
  2. Extracting links
  3. Parsing items
  4. Processing Items

Link Extraction

By default, roboto will extract all links from a page and add them onto the queue of pages to be crawled unless they:

  • Don't contain an href attribute.
  • Have rel="nofollow" or rel="noindex".
  • Don't belong to a domain listed in the crawler's allowedDomains list.
  • Match a rule on the crawler's blacklist.
  • Don't match a rule on the crawler's whitelist.
  • Have already been crawled

Also, pages will not be processed if the page's <head> contains a tag like:

  <meta name="robots">nofollow</meta>

robots.txt

In addition to the rules outlined above, roboto will also obey directives contained in a domain's robots.txt file. Directives are parsed as outlined here.

If the robots.txt file specifies a Crawl-Delay directive, that will be given precedence over the requestDelay option passed to the crawler constructor.

You can set the option obeyRobotsTxt to false in the constructor to disregard the rules in robots.txt files:

var fooCrawler = new roboto.Crawler({
  startUrls: [
    "http://www.foonews.com/latest",
  ],
  obeyRobotsTxt: false
});

Before roboto crawls an url, it will fetch the domain's robots.txt file, parse the directives, and skip crawling the url if a directive disallows it. The fetched robots.txt file is is then cached.

Note that roboto will fetch the robots.txt of subdomains. For example, when crawling http://news.ycombinator.com, http://news.ycombinator.com/robots.txt will be fetched, not http://ycombinator.com/robots.txt.

Url Normalization

Also known as URL canonicalization

This is the process of reducing syntactically different urls to a common simplified form. This is useful while crawling to ensure that multiple urls that point to the same page don't get crawled more than once.

By default roboto normalizes urls with the following procedure:

  • Unescaping url encoding /foo/%7Eexample => /foo/~example
  • Converting relative urls to absolute /foo.html => http://example.com/bar.html
  • Fully resolving paths /foo/../bar/baz.html => /bar/baz.html
  • Discarding fragments /foo.html#bar => /foo.html
  • Discarding query params /foo.html#bar => /foo.html
  • Discarding directory indexes /foo/index.html => /foo
    • index.html, index.php, default.asp, default.aspx are all discarded.
  • Removing mutliple occurrences of '/' /foo//bar///baz => /foo/bar/baz
  • Removing trailing '/' /foo/bar/ => /foo/bar

Discarding query params all together isn't optimal. A planned enhancement is to sort query params, and possibly detect safe params to remove (sort, rows, etc.).

Clone this wiki locally