Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs(guides): add 'Running in web server' guide #2543

Merged
merged 5 commits into from
Jun 24, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions docs/guides/running-in-web-server/running-in-web-server.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
---
id: running-in-web-server-guide
title: Running in web server Guide
sidebar_label: Running in web server
description: Run Crawlee in web server using a request/response approach
---

import WebServerSource from '!!raw-loader!./web-server.mjs';

Most of the time, Crawlee jobs are run as batch jobs. You have a list of URLs you want to scrape every week or you might want to scrape a whole website once per day. After the scrape, you send the data to your warehouse for analytics. Batch jobs are efficient because they can use [Crawlee's built-in autoscaling](https://crawlee.dev/docs/guides/scaling-crawlers) to fully utilize the resources you have available. But sometimes you have a use-case where you need to return scrape data as soon as possible. There might be a user waiting on the other end so every millisecond counts. This is where running Crawlee in a web server comes in.

We will build a simple HTTP server that receives a page URL and returns the page title in the response. We will base this guide on the approach used in [Apify's Super Scraper API repository](https://github.com/apify/super-scraper) which maps incoming HTTP requests to Crawlee <ApiLink to="core/class/Request">Request</ApiLink>.


metalwarrior665 marked this conversation as resolved.
Show resolved Hide resolved
## Setting up a web server
metalwarrior665 marked this conversation as resolved.
Show resolved Hide resolved

There are many popular web server frameworks for Node.js, such as Express, Koa, Fastify, and Hapi but in this guide, we will use the built-in `http` Node.js module to keep things simple.

This will be our core server setup:

```javascript
import { createServer } from 'http';
import { log } from 'crawlee';

const server = createServer(async (req, res) => {
log.info(`Request received: ${req.method} ${req.url}`);

res.writeHead(200, { 'Content-Type': 'text/plain' });
// We will return the page title here later instead
res.end('Hello World\n');
});

server.listen(3000, () => {
log.info('Server is listening for user requests');
});
```

## Creating the Crawler
metalwarrior665 marked this conversation as resolved.
Show resolved Hide resolved

We will create a standard <ApiLink to="cheerio-crawler/class/CheerioCrawler">CheerioCrawler</ApiLink> and use the <ApiLink to="cheerio-crawler/interface/CheerioCrawlerOptions#keepAlive">`keepAlive: true`</ApiLink> option to keep the crawler running even if there are no requests currently in the <ApiLink to="core/class/RequestQueue">Request Queue</ApiLink>. This way it will always be waiting for new requests to come in.

```javascript
import { CheerioCrawler, log } from 'crawlee';

const crawler = new CheerioCrawler({
keepAlive: true,
requestHandler: async ({ request, $ }) => {
const title = $('title').text();
// We will send the response here later
log.info(`Page title: ${title} on ${request.url}`);
},
});
```

## Glueing it together
metalwarrior665 marked this conversation as resolved.
Show resolved Hide resolved

Now we need to glue the server and the crawler together using the mapping of Crawlee Requests to HTTP responses discussed above. The whole program is actually quite simple. For production-grade service, you will need to improve error handling, logging, and monitoring but this is a good starting point.

<CodeBlock language="js" title="src/web-server.mjs">{WebServerSource}</CodeBlock>

48 changes: 48 additions & 0 deletions docs/guides/running-in-web-server/web-server.mjs
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
import { CheerioCrawler, log } from 'crawlee';
import { createServer } from 'http';

// We will bind an HTTP response that we want to send to the Request.uniqueKey
const requestsToResponses = new Map();

const crawler = new CheerioCrawler({
keepAlive: true,
requestHandler: async ({ request, $ }) => {
const title = $('title').text();
log.info(`Page title: ${title} on ${request.url}, sending response`);

// We will pick the response from the map and send it to the user
// We know the response is there with this uniqueKey
const httpResponse = requestsToResponses.get(request.uniqueKey);
httpResponse.writeHead(200, { 'Content-Type': 'application/json' });
httpResponse.end(JSON.stringify({ title }));
// We can delete the response from the map now to free up memory
requestsToResponses.delete(request.uniqueKey);
},
});

const server = createServer(async (req, res) => {
// We parse the requested URL from the query parameters, e.g. localhost:3000/?url=https://example.com
const urlObj = new URL(req.url, 'http://localhost:3000');
const requestedUrl = urlObj.searchParams.get('url');

log.info(`HTTP request received for ${requestedUrl}, adding to the queue`);
if (!requestedUrl) {
log.error('No URL provided as query parameter, returning 400');
res.writeHead(400, { 'Content-Type': 'application/json' });
res.end(JSON.stringify({ error: 'No URL provided as query parameter' }));
return;
}

// We will add it first to the map and then enqueue it to the crawler that immediately processes it
// uniqueKey must be random so we process the same URL again
const crawleeRequest = { url: requestedUrl, uniqueKey: `${Math.random()}` };
metalwarrior665 marked this conversation as resolved.
Show resolved Hide resolved
requestsToResponses.set(crawleeRequest.uniqueKey, res);
await crawler.addRequests([crawleeRequest]);
});

// Now we start the server, the crawler and wait for incoming connections
server.listen(3000, () => {
log.info('Server is listening for user requests');
});

await crawler.run();
Loading