-
Notifications
You must be signed in to change notification settings - Fork 24
The Crawling Process
The crawling process for a given link consists of:
- Downloading the page
- Extracting links
- Parsing items
- Processing Items
By default, roboto will extract all links from a page and add them onto the queue of pages to be crawled unless they:
- Don't contain an
href
attribute. - Have
rel="nofollow"
orrel="noindex"
. - Don't belong to a domain listed in the crawler's
allowedDomains
list. - Match a rule on the crawler's
blacklist
. - Don't match a rule on the crawler's
whitelist
. - Have already been crawled
Also, pages will not be processed if the page's <head>
contains a tag like:
<meta name="robots">nofollow</meta>
In addition to the rules outlined above, roboto will also obey directives contained in a domain's robots.txt
file.
Directives are parsed as outlined here.
If the robots.txt
file specifies a Crawl-Delay
directive, that will be given precedence over the
requestDelay
option passed to the crawler constructor.
You can set the option obeyRobotsTxt
to false
in the constructor to disregard the rules in robots.txt
files:
var fooCrawler = new roboto.Crawler({
startUrls: [
"http://www.foonews.com/latest",
],
obeyRobotsTxt: false
});
Before roboto crawls an url, it will fetch the domain's robots.txt
file, parse the directives,
and skip crawling the url if a directive disallows it. The fetched robots.txt
file is
is then cached.
Note that roboto will fetch the robots.txt of subdomains. For example, when crawling http://news.ycombinator.com
,
http://news.ycombinator.com/robots.txt
will be fetched, not http://ycombinator.com/robots.txt
.
Also known as URL canonicalization
This is the process of reducing syntactically different urls to a common simplified form. This is useful while crawling to ensure that multiple urls that point to the same page don't get crawled more than once.
By default roboto normalizes urls with the following procedure:
- Unescaping url encoding
/foo/%7Eexample => /foo/~example
- Converting relative urls to absolute
/foo.html => http://example.com/bar.html
- Fully resolving paths
/foo/../bar/baz.html => /bar/baz.html
- Discarding fragments
/foo.html#bar => /foo.html
- Discarding query params
/foo.html#bar => /foo.html
- Discarding directory indexes
/foo/index.html => /foo
- index.html, index.php, default.asp, default.aspx are all discarded.
- Removing mutliple occurrences of '/'
/foo//bar///baz => /foo/bar/baz
- Removing trailing '/'
/foo/bar/ => /foo/bar
Discarding query params all together isn't optimal. A planned enhancement is to sort query params, and possibly detect safe params to remove (sort, rows, etc.).