Scrapy implementation proposal

Table of Contents Overview Crawl tasks Spider behavior Spider output Crawler web service (scrapyd) Storage and queues

Overview

The crawler will be implemented in Scrapy. This page outlines the specification for such implementation, which should cover the requirements outlined in: Core#scrapy_implementation

Note that this is only a proposal and, as such, it may not be fully implemented or it may become outdated. For more up to date information, please refer to:

Crawler documentation
Code in HCI Github project

Crawl tasks

The core will trigger crawl tasks, which will translate to a spider run in Scrapy. Each crawl task (typically, for a single web entity) will be submitted by the core, along with a list of input parameters (aka. spider arguments), and generate some output after the crawl finishes.

Spider behavior

The spider will start crawling at the start urls and save each page as a Page item (defined below), following all links that meet certain conditions (to be included in the crawler documentation).

Spider output

Scrapy will store the full page on a long-term page storage, and inject a reduced item (without the body field) into a queue, that is consumed by the core.

Crawler web service (scrapyd)

the crawler will run on scrapyd
the core will interact with the crawler through scrapyd web service
the scrapyd api is documented here
the scrapyd api will be extended to support cancelling jobs and querying for pending, running and completed jobs

Storage and queues

The scraped items will be stored in a key-value store known as the "pagestore", serialized in JSON format
The scraped items will also be put in a queue (without the body), also serialized in JSON format
Both the queue and the pagestore will be implemented using Kyoto Cabinet which is a modern fast DBM that provides mechanisms for implementing queues (see this page for more info).
There will be one queue per crawl job, and a single global pagestore.

Using kyoto cabinet for both the queue and pagestore simplifies the dependencies, and also provides the features required to use the queue directly for passing items, instead of storing them on a separate disk file and using the queue for notification only. This will simplify the implementation, of both the consumer and producer (core and crawler).

Note that there won't be an input queue to the crawler. The input queue will be managed internally by scrapyd through the schedule.json and the core will intera won't access it directly but through scrapyd api instead (by calling schedule.json api).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly