-
Notifications
You must be signed in to change notification settings - Fork 643
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs(guides): add 'Running in web server' guide (#2543)
I took the Parallel Scraping guide as a template so please check that it makes sense. The code/example are really bare bones and there might be some code smells. I checked that the example works and the text with Grammarly. --------- Co-authored-by: Jan Buchar <[email protected]>
- Loading branch information
1 parent
a167ec3
commit 0d1f644
Showing
2 changed files
with
111 additions
and
0 deletions.
There are no files selected for viewing
62 changes: 62 additions & 0 deletions
62
docs/guides/running-in-web-server/running-in-web-server.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,62 @@ | ||
--- | ||
id: running-in-web-server-guide | ||
title: Running in web server Guide | ||
sidebar_label: Running in web server | ||
description: Run Crawlee in web server using a request/response approach | ||
--- | ||
|
||
import CodeBlock from '@theme/CodeBlock'; | ||
import ApiLink from '@site/src/components/ApiLink'; | ||
|
||
import WebServerSource from '!!raw-loader!./web-server.mjs'; | ||
|
||
Most of the time, Crawlee jobs are run as batch jobs. You have a list of URLs you want to scrape every week or you might want to scrape a whole website once per day. After the scrape, you send the data to your warehouse for analytics. Batch jobs are efficient because they can use [Crawlee's built-in autoscaling](https://crawlee.dev/docs/guides/scaling-crawlers) to fully utilize the resources you have available. But sometimes you have a use-case where you need to return scrape data as soon as possible. There might be a user waiting on the other end so every millisecond counts. This is where running Crawlee in a web server comes in. | ||
|
||
We will build a simple HTTP server that receives a page URL and returns the page title in the response. We will base this guide on the approach used in [Apify's Super Scraper API repository](https://github.com/apify/super-scraper) which maps incoming HTTP requests to Crawlee <ApiLink to="core/class/Request">Request</ApiLink>. | ||
|
||
## Set up a web server | ||
|
||
There are many popular web server frameworks for Node.js, such as Express, Koa, Fastify, and Hapi but in this guide, we will use the built-in `http` Node.js module to keep things simple. | ||
|
||
This will be our core server setup: | ||
|
||
```javascript | ||
import { createServer } from 'http'; | ||
import { log } from 'crawlee'; | ||
|
||
const server = createServer(async (req, res) => { | ||
log.info(`Request received: ${req.method} ${req.url}`); | ||
|
||
res.writeHead(200, { 'Content-Type': 'text/plain' }); | ||
// We will return the page title here later instead | ||
res.end('Hello World\n'); | ||
}); | ||
|
||
server.listen(3000, () => { | ||
log.info('Server is listening for user requests'); | ||
}); | ||
``` | ||
|
||
## Create the Crawler | ||
|
||
We will create a standard <ApiLink to="cheerio-crawler/class/CheerioCrawler">CheerioCrawler</ApiLink> and use the <ApiLink to="cheerio-crawler/interface/CheerioCrawlerOptions#keepAlive">`keepAlive: true`</ApiLink> option to keep the crawler running even if there are no requests currently in the <ApiLink to="core/class/RequestQueue">Request Queue</ApiLink>. This way it will always be waiting for new requests to come in. | ||
|
||
```javascript | ||
import { CheerioCrawler, log } from 'crawlee'; | ||
|
||
const crawler = new CheerioCrawler({ | ||
keepAlive: true, | ||
requestHandler: async ({ request, $ }) => { | ||
const title = $('title').text(); | ||
// We will send the response here later | ||
log.info(`Page title: ${title} on ${request.url}`); | ||
}, | ||
}); | ||
``` | ||
|
||
## Glue it together | ||
|
||
Now we need to glue the server and the crawler together using the mapping of Crawlee Requests to HTTP responses discussed above. The whole program is actually quite simple. For production-grade service, you will need to improve error handling, logging, and monitoring but this is a good starting point. | ||
|
||
<CodeBlock language="js" title="src/web-server.mjs">{WebServerSource}</CodeBlock> | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,49 @@ | ||
import { randomUUID } from 'node:crypto'; | ||
import { CheerioCrawler, log } from 'crawlee'; | ||
import { createServer } from 'http'; | ||
|
||
// We will bind an HTTP response that we want to send to the Request.uniqueKey | ||
const requestsToResponses = new Map(); | ||
|
||
const crawler = new CheerioCrawler({ | ||
keepAlive: true, | ||
requestHandler: async ({ request, $ }) => { | ||
const title = $('title').text(); | ||
log.info(`Page title: ${title} on ${request.url}, sending response`); | ||
|
||
// We will pick the response from the map and send it to the user | ||
// We know the response is there with this uniqueKey | ||
const httpResponse = requestsToResponses.get(request.uniqueKey); | ||
httpResponse.writeHead(200, { 'Content-Type': 'application/json' }); | ||
httpResponse.end(JSON.stringify({ title })); | ||
// We can delete the response from the map now to free up memory | ||
requestsToResponses.delete(request.uniqueKey); | ||
}, | ||
}); | ||
|
||
const server = createServer(async (req, res) => { | ||
// We parse the requested URL from the query parameters, e.g. localhost:3000/?url=https://example.com | ||
const urlObj = new URL(req.url, 'http://localhost:3000'); | ||
const requestedUrl = urlObj.searchParams.get('url'); | ||
|
||
log.info(`HTTP request received for ${requestedUrl}, adding to the queue`); | ||
if (!requestedUrl) { | ||
log.error('No URL provided as query parameter, returning 400'); | ||
res.writeHead(400, { 'Content-Type': 'application/json' }); | ||
res.end(JSON.stringify({ error: 'No URL provided as query parameter' })); | ||
return; | ||
} | ||
|
||
// We will add it first to the map and then enqueue it to the crawler that immediately processes it | ||
// uniqueKey must be random so we process the same URL again | ||
const crawleeRequest = { url: requestedUrl, uniqueKey: randomUUID() }; | ||
requestsToResponses.set(crawleeRequest.uniqueKey, res); | ||
await crawler.addRequests([crawleeRequest]); | ||
}); | ||
|
||
// Now we start the server, the crawler and wait for incoming connections | ||
server.listen(3000, () => { | ||
log.info('Server is listening for user requests'); | ||
}); | ||
|
||
await crawler.run(); |