Scrapy-rabbitmq is a tool that lets you feed and queue URLs from RabbitMQ via Scrapy spiders, using the Scrapy framework.
Inpsired by and modled after scrapy-redis.
Using pip, type in your command-line prompt
pip install scrapy-rabbitmq
Or clone the repo and inside the scrapy-rabbitmq directory, type
python setup.py install
# Enables scheduling storing requests queue in rabbitmq.
SCHEDULER = "scrapy_rabbitmq.scheduler.Scheduler"
# Don't cleanup rabbitmq queues, allows to pause/resume crawls.
SCHEDULER_PERSIST = True
# Schedule requests using a priority queue. (default)
SCHEDULER_QUEUE_CLASS = 'scrapy_rabbitmq.queue.SpiderQueue'
# RabbitMQ Queue to use to store requests
RABBITMQ_QUEUE_NAME = 'scrapy_queue'
# Provide host and port to RabbitMQ daemon
RABBITMQ_CONNECTION_PARAMETERS = {'host': 'localhost', 'port': 6666}
# Store scraped item in rabbitmq for post-processing.
ITEM_PIPELINES = {
'scrapy_rabbitmq.pipelines.RabbitMQPipeline': 1
}
from scrapy.contrib.spiders import CrawlSpider
from scrapy_rabbitmq.spiders import RabbitMQMixin
class MultiDomainSpider(RabbitMQMixin, CrawlSpider):
name = 'multidomain'
def parse(self, response):
# parse all the things
pass
Step 3: Run spider using scrapy client
scrapy runspider multidomain_spider.py
#!/usr/bin/env python
import pika
import settings
connection = pika.BlockingConnection(pika.ConnectionParameters(
'localhost'))
channel = connection.channel()
channel.basic_publish(exchange='',
routing_key=settings.RABBITMQ_QUEUE_NAME,
body='</html>raw html contents<a href="http://twitter.com/roycehaynes">extract url</a></html>')
connection.close()
See the changelog for release details.
Version | Release Date |
---|---|
0.1.0 | 2014-11-14 |
0.1.1 | 2015-07-02 |
Copyright (c) 2015 Royce Haynes - Released under The MIT License.