From ada099067878aec3a4016682641b8410e2dcbc74 Mon Sep 17 00:00:00 2001 From: Vlada Dusek Date: Thu, 22 Aug 2024 10:58:14 +0200 Subject: [PATCH] docs: add links to API doc in suitable places (#449) ### Description - doc: add links to API doc in suitable places ### Issues - Closes: #266 ### Testing - Doc website was rendered locally ### Checklist - [x] CI passed --- README.md | 6 +- docs/examples/add-data-to-dataset.mdx | 5 +- docs/examples/beautifulsoup-crawler.mdx | 4 +- .../capture-screenshot-using-playwright.mdx | 6 +- docs/examples/crawl-all-links-on-website.mdx | 3 +- docs/examples/crawl-multiple-urls.mdx | 1 + .../crawl-specific-links-on-website.mdx | 3 +- .../crawl-website-with-relative-links.mdx | 7 ++- .../export-entire-dataset-to-file.mdx | 5 +- docs/examples/parsel-crawler.mdx | 4 +- docs/examples/playwright-crawler.mdx | 6 +- docs/guides/http_clients.mdx | 6 +- docs/guides/proxy_management.mdx | 4 +- docs/introduction/01-setting-up.mdx | 56 ++++++++++++++----- docs/introduction/02-first-crawler.mdx | 34 +++++------ docs/introduction/03-adding-more-urls.mdx | 24 ++++---- docs/introduction/04-real-world-project.mdx | 6 +- docs/introduction/05-crawling.mdx | 14 +++-- docs/introduction/06-scraping.mdx | 2 + docs/introduction/07-saving-data.mdx | 16 +++--- docs/introduction/08-refactoring.mdx | 8 ++- docs/introduction/index.mdx | 2 + docs/quick-start/index.mdx | 11 +++- 23 files changed, 147 insertions(+), 86 deletions(-) diff --git a/README.md b/README.md index baf66e901..aea379d92 100644 --- a/README.md +++ b/README.md @@ -44,7 +44,7 @@ Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI pip install 'crawlee[all]' ``` -Then, install the Playwright dependencies: +Then, install the [Playwright](https://playwright.dev/) dependencies: ```sh playwright install @@ -84,7 +84,7 @@ Here are some practical examples to help you get started with different types of ### BeautifulSoupCrawler -The `BeautifulSoupCrawler` downloads web pages using an HTTP library and provides HTML-parsed content to the user. It uses [HTTPX](https://pypi.org/project/httpx/) for HTTP communication and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) for parsing HTML. It is ideal for projects that require efficient extraction of data from HTML content. This crawler has very good performance since it does not use a browser. However, if you need to execute client-side JavaScript, to get your content, this is not going to be enough and you will need to use PlaywrightCrawler. Also if you want to use this crawler, make sure you install `crawlee` with `beautifulsoup` extra. +The [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) downloads web pages using an HTTP library and provides HTML-parsed content to the user. By default it uses [`HttpxHttpClient`](https://crawlee.dev/python/api/class/HttpxHttpClient) for HTTP communication and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) for parsing HTML. It is ideal for projects that require efficient extraction of data from HTML content. This crawler has very good performance since it does not use a browser. However, if you need to execute client-side JavaScript, to get your content, this is not going to be enough and you will need to use [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler). Also if you want to use this crawler, make sure you install `crawlee` with `beautifulsoup` extra. ```python import asyncio @@ -124,7 +124,7 @@ if __name__ == '__main__': ### PlaywrightCrawler -The `PlaywrightCrawler` uses a headless browser to download web pages and provides an API for data extraction. It is built on [Playwright](https://playwright.dev/), an automation library designed for managing headless browsers. It excels at retrieving web pages that rely on client-side JavaScript for content generation, or tasks requiring interaction with JavaScript-driven content. For scenarios where JavaScript execution is unnecessary or higher performance is required, consider using the `BeautifulSoupCrawler`. Also if you want to use this crawler, make sure you install `crawlee` with `playwright` extra. +The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) uses a headless browser to download web pages and provides an API for data extraction. It is built on [Playwright](https://playwright.dev/), an automation library designed for managing headless browsers. It excels at retrieving web pages that rely on client-side JavaScript for content generation, or tasks requiring interaction with JavaScript-driven content. For scenarios where JavaScript execution is unnecessary or higher performance is required, consider using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler). Also if you want to use this crawler, make sure you install `crawlee` with `playwright` extra. ```python import asyncio diff --git a/docs/examples/add-data-to-dataset.mdx b/docs/examples/add-data-to-dataset.mdx index c55da9443..9f607a3ee 100644 --- a/docs/examples/add-data-to-dataset.mdx +++ b/docs/examples/add-data-to-dataset.mdx @@ -3,10 +3,11 @@ id: add-data-to-dataset title: Add data to dataset --- +import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -This example demonstrates how to store extracted data into datasets using the `context.push_data()` helper function. If the specified dataset does not already exist, it will be created automatically. Additionally, you can save data to custom datasets by providing `dataset_id` or `dataset_name` parameters to the `push_data` method. +This example demonstrates how to store extracted data into datasets using the `context.push_data` helper function. If the specified dataset does not already exist, it will be created automatically. Additionally, you can save data to custom datasets by providing `dataset_id` or `dataset_name` parameters to the `push_data` function. @@ -99,7 +100,7 @@ Each item in the dataset will be stored in its own file within the following dir {PROJECT_FOLDER}/storage/datasets/default/ ``` -For more control, you can also open a dataset manually using the asynchronous constructor `Dataset.open()` and interact with it directly: +For more control, you can also open a dataset manually using the asynchronous constructor `Dataset.open` ```python from crawlee.storages import Dataset diff --git a/docs/examples/beautifulsoup-crawler.mdx b/docs/examples/beautifulsoup-crawler.mdx index 0372ae751..ee8613991 100644 --- a/docs/examples/beautifulsoup-crawler.mdx +++ b/docs/examples/beautifulsoup-crawler.mdx @@ -3,7 +3,9 @@ id: beautifulsoup-crawler title: BeautifulSoup crawler --- -This example demonstrates how to use `BeautifulSoupCrawler` to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library and extract some data from it - the page title and all `

`, `

` and `

` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code. +import ApiLink from '@site/src/components/ApiLink'; + +This example demonstrates how to use `BeautifulSoupCrawler` to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library and extract some data from it - the page title and all `

`, `

` and `

` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code. ```python import asyncio diff --git a/docs/examples/capture-screenshot-using-playwright.mdx b/docs/examples/capture-screenshot-using-playwright.mdx index 93fcd74f4..b22698c46 100644 --- a/docs/examples/capture-screenshot-using-playwright.mdx +++ b/docs/examples/capture-screenshot-using-playwright.mdx @@ -3,9 +3,11 @@ id: capture-screenshots-using-playwright title: Capture screenshots using Playwright --- -This example demonstrates how to capture screenshots of web pages using `PlaywrightCrawler` and store them in the key-value store. +import ApiLink from '@site/src/components/ApiLink'; -The `PlaywrightCrawler` is configured to automate the browsing and interaction with web pages. It uses headless Chromium as the browser type to perform these tasks. Each web page specified in the initial list of URLs is visited sequentially, and a screenshot of the page is captured using Playwright's `page.screenshot()` method. +This example demonstrates how to capture screenshots of web pages using `PlaywrightCrawler` and store them in the key-value store. + +The `PlaywrightCrawler` is configured to automate the browsing and interaction with web pages. It uses headless Chromium as the browser type to perform these tasks. Each web page specified in the initial list of URLs is visited sequentially, and a screenshot of the page is captured using Playwright's `page.screenshot()` method. The captured screenshots are stored in the key-value store, which is suitable for managing and storing files in various formats. In this case, screenshots are stored as PNG images with a unique key generated from the URL of the page. diff --git a/docs/examples/crawl-all-links-on-website.mdx b/docs/examples/crawl-all-links-on-website.mdx index 22d4496c6..543f35b80 100644 --- a/docs/examples/crawl-all-links-on-website.mdx +++ b/docs/examples/crawl-all-links-on-website.mdx @@ -3,10 +3,11 @@ id: crawl-all-links-on-website title: Crawl all links on website --- +import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -This example uses the `enqueue_links()` helper to add new links to the `RequestQueue` as the crawler navigates from page to page. By automatically discovering and enqueuing all links on a given page, the crawler can systematically scrape an entire website. This approach is ideal for web scraping tasks where you need to collect data from multiple interconnected pages. +This example uses the `enqueue_links` helper to add new links to the `RequestQueue` as the crawler navigates from page to page. By automatically discovering and enqueuing all links on a given page, the crawler can systematically scrape an entire website. This approach is ideal for web scraping tasks where you need to collect data from multiple interconnected pages. :::tip diff --git a/docs/examples/crawl-multiple-urls.mdx b/docs/examples/crawl-multiple-urls.mdx index 21e832909..63f6bdafa 100644 --- a/docs/examples/crawl-multiple-urls.mdx +++ b/docs/examples/crawl-multiple-urls.mdx @@ -3,6 +3,7 @@ id: crawl-multiple-urls title: Crawl multiple URLs --- +import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; diff --git a/docs/examples/crawl-specific-links-on-website.mdx b/docs/examples/crawl-specific-links-on-website.mdx index c368a7404..717779c84 100644 --- a/docs/examples/crawl-specific-links-on-website.mdx +++ b/docs/examples/crawl-specific-links-on-website.mdx @@ -3,10 +3,11 @@ id: crawl-specific-links-on-website title: Crawl specific links on website --- +import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -This example demonstrates how to crawl a website while targeting specific patterns of links. By utilizing the `enqueue_links()` helper, you can pass `include` or `exclude` parameters to improve your crawling strategy. This approach ensures that only the links matching the specified patterns are added to the `RequestQueue`. Both `include` and `exclude` support lists of globs or regular expressions. This functionality is great for focusing on relevant sections of a website and avoiding scraping unnecessary or irrelevant content. +This example demonstrates how to crawl a website while targeting specific patterns of links. By utilizing the `enqueue_links` helper, you can pass `include` or `exclude` parameters to improve your crawling strategy. This approach ensures that only the links matching the specified patterns are added to the `RequestQueue`. Both `include` and `exclude` support lists of globs or regular expressions. This functionality is great for focusing on relevant sections of a website and avoiding scraping unnecessary or irrelevant content. diff --git a/docs/examples/crawl-website-with-relative-links.mdx b/docs/examples/crawl-website-with-relative-links.mdx index 666c37310..5bc5b4560 100644 --- a/docs/examples/crawl-website-with-relative-links.mdx +++ b/docs/examples/crawl-website-with-relative-links.mdx @@ -3,18 +3,19 @@ id: crawl-website-with-relative-links title: Crawl website with relative links --- +import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -When crawling a website, you may encounter various types of links that you wish to include in your crawl. To facilitate this, we provide the `enqueue_links()` method on the crawler context, which will automatically find and add these links to the crawler's `RequestQueue`. This method simplifies the process of handling different types of links, including relative links, by automatically resolving them based on the page's context. +When crawling a website, you may encounter various types of links that you wish to include in your crawl. To facilitate this, we provide the `enqueue_links` method on the crawler context, which will automatically find and add these links to the crawler's `RequestQueue`. This method simplifies the process of handling different types of links, including relative links, by automatically resolving them based on the page's context. :::note -For these examples, we are using the `BeautifulSoupCrawler`. However, the same method is available for the `PlaywrightCrawler` as well. You can use it in exactly the same way. +For these examples, we are using the `BeautifulSoupCrawler`. However, the same method is available for the `PlaywrightCrawler` as well. You can use it in exactly the same way. ::: -We provide four distinct strategies for crawling relative links: +`EnqueueStrategy` enum provides four distinct strategies for crawling relative links: - `EnqueueStrategy.All` - Enqueues all links found, regardless of the domain they point to. This strategy is useful when you want to follow every link, including those that navigate to external websites. - `EnqueueStrategy.SAME_DOMAIN` - Enqueues all links found that share the same domain name, including any possible subdomains. This strategy ensures that all links within the same top-level and base domain are included. diff --git a/docs/examples/export-entire-dataset-to-file.mdx b/docs/examples/export-entire-dataset-to-file.mdx index 827be3e9c..fdc7c4575 100644 --- a/docs/examples/export-entire-dataset-to-file.mdx +++ b/docs/examples/export-entire-dataset-to-file.mdx @@ -3,14 +3,15 @@ id: export-entire-dataset-to-file title: Export entire dataset to file --- +import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -This example demonstrates how to use the `export_data()` method of the crawler to export the entire default dataset to a single file. This method supports exporting data in either CSV or JSON format. +This example demonstrates how to use the `BasicCrawler.export_data` method of the crawler to export the entire default dataset to a single file. This method supports exporting data in either CSV or JSON format. :::note -For these examples, we are using the `BeautifulSoupCrawler`. However, the same method is available for the `PlaywrightCrawler` as well. You can use it in exactly the same way. +For these examples, we are using the `BeautifulSoupCrawler`. However, the same method is available for the `PlaywrightCrawler` as well. You can use it in exactly the same way. ::: diff --git a/docs/examples/parsel-crawler.mdx b/docs/examples/parsel-crawler.mdx index 680f411ad..d102ab931 100644 --- a/docs/examples/parsel-crawler.mdx +++ b/docs/examples/parsel-crawler.mdx @@ -3,7 +3,9 @@ id: parsel-crawler title: Parsel crawler --- -This example shows how to use `ParselCrawler` to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using [Parsel](https://pypi.org/project/parsel/) library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping. +import ApiLink from '@site/src/components/ApiLink'; + +This example shows how to use `ParselCrawler` to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using [Parsel](https://pypi.org/project/parsel/) library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping. ```python diff --git a/docs/examples/playwright-crawler.mdx b/docs/examples/playwright-crawler.mdx index f43d472a9..7e0aeb58a 100644 --- a/docs/examples/playwright-crawler.mdx +++ b/docs/examples/playwright-crawler.mdx @@ -3,9 +3,11 @@ id: playwright-crawler title: Playwright crawler --- -This example demonstrates how to use `PlaywrightCrawler` to recursively scrape the Hacker news website using headless Chromium and Playwright. +import ApiLink from '@site/src/components/ApiLink'; -The `PlaywrightCrawler` manages the browser and page instances, simplifying the process of interacting with web pages. In the request handler, Playwright's API is used to extract data from each post on the page. Specifically, it retrieves the title, rank, and URL of each post. Additionally, the handler enqueues links to the next pages to ensure continuous scraping. This setup is ideal for scraping dynamic web pages where JavaScript execution is required to render the content. +This example demonstrates how to use `PlaywrightCrawler` to recursively scrape the Hacker news website using headless Chromium and Playwright. + +The `PlaywrightCrawler` manages the browser and page instances, simplifying the process of interacting with web pages. In the request handler, Playwright's API is used to extract data from each post on the page. Specifically, it retrieves the title, rank, and URL of each post. Additionally, the handler enqueues links to the next pages to ensure continuous scraping. This setup is ideal for scraping dynamic web pages where JavaScript execution is required to render the content. ```python import asyncio diff --git a/docs/guides/http_clients.mdx b/docs/guides/http_clients.mdx index b9ffe8c65..2e0b0066c 100644 --- a/docs/guides/http_clients.mdx +++ b/docs/guides/http_clients.mdx @@ -16,7 +16,7 @@ HTTP clients are utilized by the HTTP-based crawlers (e.g. `HttpxHttpClient`, which uses the `httpx` library, and `CurlImpersonateHttpClient`, which uses the `curl-cffi` library. You can switch between them by setting the `http_client` parameter in the Crawler class. The default HTTP client is `HttpxHttpClient`. Below are examples of how to set the HTTP client for the `BeautifulSoupCrawler`. +In Crawlee we currently have two HTTP clients: `HttpxHttpClient`, which uses the `httpx` library, and `CurlImpersonateHttpClient`, which uses the `curl-cffi` library. You can switch between them by setting the `http_client` parameter in the Crawler class. The default HTTP client is `HttpxHttpClient`. Below are examples of how to set the HTTP client for the `BeautifulSoupCrawler`. @@ -33,7 +33,7 @@ In Crawlee we currently have two HTTP clients: `HttpxHttpClient` is the default HTTP client, you don't need to install additional packages to use it. If you want to use `CurlImpersonateHttpClient`, you need to install `crawlee` with the `curl-impersonate` extra. ```sh pip install 'crawlee[curl-impersonate]' @@ -47,4 +47,4 @@ pip install 'crawlee[all]' ## How HTTP clients work -We provide an abstract base class, `BaseHttpClient`, which defines the necessary interface for all HTTP clients. HTTP clients are responsible for sending requests and receiving responses, as well as managing cookies, headers, and proxies. They provide methods that are called from crawlers. To implement your own HTTP client, inherit from the `BaseHttpClient` class and implement the required methods. +We provide an abstract base class, `BaseHttpClient`, which defines the necessary interface for all HTTP clients. HTTP clients are responsible for sending requests and receiving responses, as well as managing cookies, headers, and proxies. They provide methods that are called from crawlers. To implement your own HTTP client, inherit from the `BaseHttpClient` class and implement the required methods. diff --git a/docs/guides/proxy_management.mdx b/docs/guides/proxy_management.mdx index 87bcd2538..2d42df337 100644 --- a/docs/guides/proxy_management.mdx +++ b/docs/guides/proxy_management.mdx @@ -42,7 +42,7 @@ Examples of how to use our proxy URLs with crawlers are shown below in [Crawler ## Proxy configuration -All our proxy needs are managed by the `ProxyConfiguration` class. We create an instance using the `ProxyConfiguration` constructor function based on the provided options. +All our proxy needs are managed by the `ProxyConfiguration` class. We create an instance using the `ProxyConfiguration` constructor function based on the provided options. ### Crawler integration @@ -107,7 +107,7 @@ In an active tier, Crawlee will alternate between proxies in a round-robin fashi ## Inspecting current proxy in crawlers -The `BeautifulSoupCrawler` and `PlaywrightCrawler` provide access to information about the currently used proxy via the request handler using a `proxy_info` object. This object allows easy access to the proxy URL. +The `BeautifulSoupCrawler` and `PlaywrightCrawler` provide access to information about the currently used proxy via the request handler using a `proxy_info` object. This object allows easy access to the proxy URL. diff --git a/docs/introduction/01-setting-up.mdx b/docs/introduction/01-setting-up.mdx index eb7acf524..734ce78a1 100644 --- a/docs/introduction/01-setting-up.mdx +++ b/docs/introduction/01-setting-up.mdx @@ -3,6 +3,8 @@ id: setting-up title: Setting up --- +import ApiLink from '@site/src/components/ApiLink'; + To run Crawlee on your computer, ensure you meet the following requirements: 1. [Python](https://www.python.org/) 3.9 or higher installed, @@ -20,52 +22,76 @@ pip --version ## Installation -Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI package. +Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI package. To install the core package, use: ```sh pip install crawlee ``` -Additional, optional dependencies unlocking more features are shipped as package extras. +After installation, verify that Crawlee is installed correctly by checking its version: + +```sh +python -c 'import crawlee; print(crawlee.__version__)' +``` + +Crawlee offers several optional features through package extras. You can choose to install only the dependencies you need or install everything if you don't mind the package size. + +### Install all features -If you plan to parse HTML and use CSS selectors, install `crawlee` with either the `beautifulsoup` or `parsel` extra: +If you do not care about the package size, install Crawlee with all features: ```sh -pip install 'crawlee[beautifulsoup]' +pip install 'crawlee[all]' ``` +### Installing only specific extras + +Depending on your use case, you may want to install specific extras to enable additional functionality: + +#### BeautifulSoup + +For using the `BeautifulSoupCrawler`, install the `beautifulsoup` extra: + ```sh -pip install 'crawlee[parsel]' +pip install 'crawlee[beautifulsoup]' ``` -If you plan to use a (headless) browser, install `crawlee` with the `playwright` extra: +#### Parsel + +For using the `ParselCrawler`, install the `parsel` extra: ```sh -pip install 'crawlee[playwright]' +pip install 'crawlee[parsel]' ``` -Then, install the Playwright dependencies: +#### Curl impersonate + +For using the `CurlImpersonateHttpClient`, install the `curl-impersonate` extra: ```sh -playwright install +pip install 'crawlee[curl-impersonate]' ``` -You can install multiple extras at once by using a comma as a separator: +#### Playwright + +If you plan to use a (headless) browser with `PlaywrightCrawler`, install Crawlee with the `playwright` extra: ```sh -pip install 'crawlee[beautifulsoup,playwright]' +pip install 'crawlee[playwright]' ``` -Or if you do not care about the package size, you can install everything: +After installing the playwright extra, install the necessary Playwright dependencies: ```sh -pip install 'crawlee[all]' +playwright install ``` -Verify that Crawlee is successfully installed: +### Installing multiple extras + +You can install multiple extras at once by using a comma as a separator: ```sh -python -c 'import crawlee; print(crawlee.__version__)' +pip install 'crawlee[beautifulsoup,curl-impersonate]' ``` ## With Crawlee CLI diff --git a/docs/introduction/02-first-crawler.mdx b/docs/introduction/02-first-crawler.mdx index 9a4538f15..c27421637 100644 --- a/docs/introduction/02-first-crawler.mdx +++ b/docs/introduction/02-first-crawler.mdx @@ -3,15 +3,17 @@ id: first-crawler title: First crawler --- +import ApiLink from '@site/src/components/ApiLink'; + Now, you will build your first crawler. But before you do, let's briefly introduce the Crawlee classes involved in the process. ## How Crawlee works There are 3 main crawler classes available for use in Crawlee. -- `BeautifulSoupCrawler` -- `ParselCrawler` -- `PlaywrightCrawler` +- `BeautifulSoupCrawler` +- `ParselCrawler` +- `PlaywrightCrawler` We'll talk about their differences later. Now, let's talk about what they have in common. @@ -19,15 +21,15 @@ The general idea of each crawler is to go to a web page, open it, do some stuff ### The where - `Request` and `RequestQueue` -All crawlers use instances of the `Request` class to determine where they need to go. Each request may hold a lot of information, but at the very least, it must hold a URL - a web page to open. But having only one URL would not make sense for crawling. Sometimes you have a pre-existing list of your own URLs that you wish to visit, perhaps a thousand. Other times you need to build this list dynamically as you crawl, adding more and more URLs to the list as you progress. Most of the time, you will use both options. +All crawlers use instances of the `Request` class to determine where they need to go. Each request may hold a lot of information, but at the very least, it must hold a URL - a web page to open. But having only one URL would not make sense for crawling. Sometimes you have a pre-existing list of your own URLs that you wish to visit, perhaps a thousand. Other times you need to build this list dynamically as you crawl, adding more and more URLs to the list as you progress. Most of the time, you will use both options. -The requests are stored in a `RequestQueue`, a dynamic queue of `Request` instances. You can seed it with start URLs and also add more requests while the crawler is running. This allows the crawler to open one page, extract interesting data, such as links to other pages on the same domain, add them to the queue (called _enqueuing_) and repeat this process to build a queue of virtually unlimited number of URLs. +The requests are stored in a `RequestQueue`, a dynamic queue of `Request` instances. You can seed it with start URLs and also add more requests while the crawler is running. This allows the crawler to open one page, extract interesting data, such as links to other pages on the same domain, add them to the queue (called _enqueuing_) and repeat this process to build a queue of virtually unlimited number of URLs. ### The what - request handler In the request handler you tell the crawler what to do at each and every page it visits. You can use it to handle extraction of data from the page, processing the data, saving it, calling APIs, doing calculations and so on. -The request handler is a user-defined function, invoked automatically by the crawler for each `Request` from the `RequestQueue`. It always receives a single argument - `CrawlingContext`. Its properties change depending on the crawler class used, but it always includes the `request` property, which represents the currently crawled URL and related metadata. +The request handler is a user-defined function, invoked automatically by the crawler for each `Request` from the `RequestQueue`. It always receives a single argument - `BasicCrawlingContext` (or its descendants). Its properties change depending on the crawler class used, but it always includes the `request` property, which represents the currently crawled URL and related metadata. ## Building a crawler @@ -52,21 +54,21 @@ if __name__ == '__main__': asyncio.run(main()) ``` -The `RequestQueue.add_request()` method automatically converts the object with URL string to a `Request` instance. So now you have a `RequestQueue` that holds one request which points to `https://crawlee.dev`. +The `RequestQueue.add_request` method automatically converts the object with URL string to a `Request` instance. So now you have a `RequestQueue` that holds one request which points to `https://crawlee.dev`. :::tip Bulk add requests -The code above is for illustration of the request queue concept. Soon you'll learn about the `Crawler.add_requests()` method which allows you to skip this initialization code, and it also supports adding a large number of requests without blocking. +The code above is for illustration of the request queue concept. Soon you'll learn about the `BasicCrawler.add_requests` method which allows you to skip this initialization code, and it also supports adding a large number of requests without blocking. ::: ### Building a BeautifulSoupCrawler -Crawlee comes with thre main crawler classes: `BeautifulSoupCrawler`, `ParselCrawler`, and `PlaywrightCrawler`. You can read their short descriptions in the [Quick start](../quick-start) lesson. +Crawlee comes with thre main crawler classes: `BeautifulSoupCrawler`, `ParselCrawler`, and `PlaywrightCrawler`. You can read their short descriptions in the [Quick start](../quick-start) lesson. -Unless you have a good reason to start with a different one, you should try building a `BeautifulSoupCrawler` first. It is an HTTP crawler with HTTP2 support, anti-blocking features and integrated HTML parser - [BeautifulSoup](https://pypi.org/project/beautifulsoup4/). It's fast, simple, cheap to run and does not require complicated dependencies. The only downside is that it won't work out of the box for websites which require JavaScript rendering. But you might not need JavaScript rendering at all, because many modern websites use server-side rendering. +Unless you have a good reason to start with a different one, you should try building a `BeautifulSoupCrawler` first. It is an HTTP crawler with HTTP2 support, anti-blocking features and integrated HTML parser - [BeautifulSoup](https://pypi.org/project/beautifulsoup4/). It's fast, simple, cheap to run and does not require complicated dependencies. The only downside is that it won't work out of the box for websites which require JavaScript rendering. But you might not need JavaScript rendering at all, because many modern websites use server-side rendering. -Let's continue with the earlier `RequestQueue` example. +Let's continue with the earlier `RequestQueue` example. ```python import asyncio @@ -98,7 +100,7 @@ if __name__ == '__main__': asyncio.run(main()) ``` -When you run the example, you will see the title of https://crawlee.dev printed to the log. What really happens is that `BeautifulSoupCrawler` first makes an HTTP request to `https://crawlee.dev`, then parses the received HTML with BeautifulSoup and makes it available as the `context` argument of the request handler. +When you run the example, you will see the title of https://crawlee.dev printed to the log. What really happens is that `BeautifulSoupCrawler` first makes an HTTP request to `https://crawlee.dev`, then parses the received HTML with BeautifulSoup and makes it available as the `context` argument of the request handler. ```log [__main__] INFO The title of "https://crawlee.dev" is "Crawlee ยท Build reliable crawlers. Fast. | Crawlee". @@ -106,7 +108,7 @@ When you run the example, you will see the title of https://crawlee.dev printed ### Add requests faster -Earlier we mentioned that you'll learn how to use the `Crawler.add_requests()` method to skip the request queue initialization. It's simple. Every crawler has an implicit `RequestQueue` instance, and you can add requests to it with the `Crawler.add_requests()` method. In fact, you can go even further and just use the first parameter of `crawler.run()`! +Earlier we mentioned that you'll learn how to use the `BasicCrawler.add_requests` method to skip the request queue initialization. It's simple. Every crawler has an implicit `RequestQueue` instance, and you can add requests to it with the `BasicCrawler.add_requests` method. In fact, you can go even further and just use the first parameter of `crawler.run()`! ```python import asyncio @@ -129,14 +131,14 @@ if __name__ == '__main__': asyncio.run(main()) ``` -When you run this code, you'll see exactly the same output as with the earlier, longer example. The `RequestQueue` is still there, it's just managed by the crawler automatically. +When you run this code, you'll see exactly the same output as with the earlier, longer example. The `RequestQueue` is still there, it's just managed by the crawler automatically. :::info -This method not only makes the code shorter, it will help with performance too! Internally it calls `RequestQueue.add_requests_batched()` method. It will wait only for the initial batch of 1000 requests to be added to the queue before resolving, which means the processing will start almost instantly. After that, it will continue adding the rest of the requests in the background (again, in batches of 1000 items, once every second). +This method not only makes the code shorter, it will help with performance too! Internally it calls `RequestQueue.add_requests_batched` method. It will wait only for the initial batch of 1000 requests to be added to the queue before resolving, which means the processing will start almost instantly. After that, it will continue adding the rest of the requests in the background (again, in batches of 1000 items, once every second). ::: ## Next steps -Next, you'll learn about crawling links. That means finding new URLs on the pages you crawl and adding them to the `RequestQueue` for the crawler to visit. +Next, you'll learn about crawling links. That means finding new URLs on the pages you crawl and adding them to the `RequestQueue` for the crawler to visit. diff --git a/docs/introduction/03-adding-more-urls.mdx b/docs/introduction/03-adding-more-urls.mdx index cbdc167eb..8f52b40a0 100644 --- a/docs/introduction/03-adding-more-urls.mdx +++ b/docs/introduction/03-adding-more-urls.mdx @@ -3,6 +3,8 @@ id: adding-more-urls title: Adding more URLs --- +import ApiLink from '@site/src/components/ApiLink'; + Previously you've built a very simple crawler that downloads HTML of a single page, reads its title and prints it to the console. This is the original source code: ```python @@ -24,7 +26,7 @@ if __name__ == '__main__': asyncio.run(main()) ``` -Now you'll use the example from the previous section and improve on it. You'll add more URLs to the queue and thanks to that the crawler will keep going, finding new links, enqueuing them into the `RequestQueue` and then scraping them. +Now you'll use the example from the previous section and improve on it. You'll add more URLs to the queue and thanks to that the crawler will keep going, finding new links, enqueuing them into the `RequestQueue` and then scraping them. ## How crawling works @@ -32,21 +34,21 @@ The process is simple: 1. Find new links on the page. 2. Filter only those pointing to the same domain, in this case [crawlee.dev](https://crawlee.dev/). -3. Enqueue (add) them to the `RequestQueue`. +3. Enqueue (add) them to the `RequestQueue`. 4. Visit the newly enqueued links. 5. Repeat the process. -In the following paragraphs you will learn about the `enqueue_links` function which simplifies crawling to a single function call. +In the following paragraphs you will learn about the `enqueue_links` function which simplifies crawling to a single function call. :::tip context awareness -The `enqueue_links` function is context aware. It means that it will read the information about the currently crawled page from the context, and you don't need to explicitly provide any arguments. However, you can specify filtering criteria or an enqueuing strategy if desired. It will find the links and automatically add the links to the running crawler's `RequestQueue`. +The `enqueue_links` function is context aware. It means that it will read the information about the currently crawled page from the context, and you don't need to explicitly provide any arguments. However, you can specify filtering criteria or an enqueuing strategy if desired. It will find the links and automatically add the links to the running crawler's `RequestQueue`. ::: ## Limit your crawls -When you're just testing your code or when your crawler could potentially find millions of links, it's very useful to set a maximum limit of crawled pages. The option is called `max_requests_per_crawl`, is available in all crawlers, and you can set it like this: +When you're just testing your code or when your crawler could potentially find millions of links, it's very useful to set a maximum limit of crawled pages. The option is called `max_requests_per_crawl`, is available in all crawlers, and you can set it like this: ```python crawler = BeautifulSoupCrawler(max_requests_per_crawl=20) @@ -62,7 +64,7 @@ There are numerous approaches to finding links to follow when crawling the web. This is a link to Crawlee introduction ``` -Since this is the most common case, it is also the `enqueue_links` default. +Since this is the most common case, it is also the `enqueue_links` default. ```python import asyncio @@ -88,7 +90,7 @@ if __name__ == '__main__': asyncio.run(main()) ``` -If you need to override the default selection of elements in `enqueue_links`, you can use the `selector` argument. +If you need to override the default selection of elements in `enqueue_links`, you can use the `selector` argument. ```python await context.enqueue_links(selector='a.article-link') @@ -104,7 +106,7 @@ Websites typically contain a lot of links that lead away from the original page. await context.enqueue_links() ``` -The default behavior of `enqueue_links` is to stay on the same hostname. This **does not include subdomains**. To include subdomains in your crawl, use the `strategy` argument. +The default behavior of `enqueue_links` is to stay on the same hostname. This **does not include subdomains**. To include subdomains in your crawl, use the `strategy` argument. ```python # See the EnqueueStrategy object for more strategy options. @@ -115,11 +117,11 @@ When you run the code, you will see the crawler log the **title** of the first p ## Skipping duplicate URLs -Skipping of duplicate URLs is critical, because visiting the same page multiple times would lead to duplicate results. This is automatically handled by the `RequestQueue` which deduplicates requests using their `unique_key`. This `unique_key` is automatically generated from the request's URL by lowercasing the URL, lexically ordering query parameters, removing fragments and a few other tweaks that ensure the queue only includes unique URLs. +Skipping of duplicate URLs is critical, because visiting the same page multiple times would lead to duplicate results. This is automatically handled by the `RequestQueue` which deduplicates requests using their `unique_key`. This `unique_key` is automatically generated from the request's URL by lowercasing the URL, lexically ordering query parameters, removing fragments and a few other tweaks that ensure the queue only includes unique URLs. ## Advanced filtering arguments -While the defaults for `enqueue_links` can be often exactly what you need, it also gives you fine-grained control over which URLs should be enqueued. One way we already mentioned above. It is using the `EnqueueStrategy`. You can use the `all` strategy if you want to follow every single link, regardless of its domain, or you can enqueue links that target the same domain name with the `same_domain` strategy. +While the defaults for `enqueue_links` can be often exactly what you need, it also gives you fine-grained control over which URLs should be enqueued. One way we already mentioned above. It is using the `EnqueueStrategy`. You can use the `all` strategy if you want to follow every single link, regardless of its domain, or you can enqueue links that target the same domain name with the `same_domain` strategy. ```python # Wanders the internet. @@ -128,7 +130,7 @@ await enqueue_links(strategy='all') ### Filter URLs with patterns -For even more control, you can use the `include` or `exclude` parameters, either as glob patterns or regular expressions, to filter the URLs. Refer to the API documentation for `enqueue_links` for detailed information on these and other available options. +For even more control, you can use the `include` or `exclude` parameters, either as glob patterns or regular expressions, to filter the URLs. Refer to the API documentation for `enqueue_links` for detailed information on these and other available options. ```python from crawlee import Glob diff --git a/docs/introduction/04-real-world-project.mdx b/docs/introduction/04-real-world-project.mdx index fb9c2a9d4..f36026dae 100644 --- a/docs/introduction/04-real-world-project.mdx +++ b/docs/introduction/04-real-world-project.mdx @@ -3,6 +3,8 @@ id: real-world-project title: Real-world project --- +import ApiLink from '@site/src/components/ApiLink'; + > _Hey, guys, you know, it's cool that we can scrape the `` elements of web pages, but that's not very useful. Can we finally scrape some real data and save it somewhere in a machine-readable format? Because that's why I started reading this tutorial in the first place!_ We hear you, young padawan! First, learn how to crawl, you must. Only then, walk through data, you can! @@ -24,12 +26,12 @@ If you're not interested in crawling theory, feel free to [skip to the next chap Sometimes scraping is really straightforward, but most of the time, it really pays off to do a bit of research first and try to answer some of these questions: - How is the website structured? -- Can I scrape it only with HTTP requests (read "with `BeautifulSoupCrawler`")? +- Can I scrape it only with HTTP requests (read "with some <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink>, e.g. <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>")? - Do I need a headless browser for something? - Are there any anti-scraping protections in place? - Do I need to parse the HTML or can I get the data otherwise, such as directly from the website's API? -For the purposes of this tutorial, let's assume that the website cannot be scraped with `BeautifulSoupCrawler`. It actually can, but we would have to dive a bit deeper than this introductory guide allows. So for now we will make things easier for you, scrape it with `PlaywrightCrawler`, and you'll learn about headless browsers in the process. +For the purposes of this tutorial, let's assume that the website cannot be scraped with <ApiLink to="class/HttpCrawler">`HttpCrawler`</ApiLink>. It actually can, but we would have to dive a bit deeper than this introductory guide allows. So for now we will make things easier for you, scrape it with <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>, and you'll learn about headless browsers in the process. ## Choosing the data you need diff --git a/docs/introduction/05-crawling.mdx b/docs/introduction/05-crawling.mdx index 7491dbacb..37aceda1c 100644 --- a/docs/introduction/05-crawling.mdx +++ b/docs/introduction/05-crawling.mdx @@ -3,17 +3,19 @@ id: crawling title: Crawling --- +import ApiLink from '@site/src/components/ApiLink'; + To crawl the whole [Warehouse store example](https://warehouse-theme-metal.myshopify.com/collections) and find all the data, you first need to visit all the pages with products - going through all categories available and also all the product detail pages. ## Crawling the listing pages -In previous lessons, you used the `enqueue_links()` function like this: +In previous lessons, you used the <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> function like this: ```python await enqueue_links() ``` -While useful in that scenario, you need something different now. Instead of finding all the `<a href="..">` elements with links to the same hostname, you need to find only the specific ones that will take your crawler to the next page of results. Otherwise, the crawler will visit a lot of other pages that you're not interested in. Using the power of DevTools and yet another `enqueue_links()` parameter, this becomes fairly easy. +While useful in that scenario, you need something different now. Instead of finding all the `<a href="..">` elements with links to the same hostname, you need to find only the specific ones that will take your crawler to the next page of results. Otherwise, the crawler will visit a lot of other pages that you're not interested in. Using the power of DevTools and yet another <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> parameter, this becomes fairly easy. ```python import asyncio @@ -48,13 +50,13 @@ if __name__ == '__main__': The code should look pretty familiar to you. It's a very simple request handler where we log the currently processed URL to the console and enqueue more links. But there are also a few new, interesting additions. Let's break it down. -### The `selector` parameter of `enqueue_links()` +### The `selector` parameter of `enqueue_links` -When you previously used `enqueue_links()`, you were not providing any `selector` parameter, and it was fine, because you wanted to use the default value, which is `a` - finds all `<a>` elements. But now, you need to be more specific. There are multiple `<a>` links on the `Categories` page, and you're only interested in those that will take your crawler to the available list of results. Using the DevTools, you'll find that you can select the links you need using the `.collection-block-item` selector, which selects all the elements that have the `class=collection-block-item` attribute. +When you previously used <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink>, you were not providing any `selector` parameter, and it was fine, because you wanted to use the default value, which is `a` - finds all `<a>` elements. But now, you need to be more specific. There are multiple `<a>` links on the `Categories` page, and you're only interested in those that will take your crawler to the available list of results. Using the DevTools, you'll find that you can select the links you need using the `.collection-block-item` selector, which selects all the elements that have the `class=collection-block-item` attribute. -### The `label` of `enqueue_links()` +### The `label` of `enqueue_links` -You will see `label` used often throughout Crawlee, as it's a convenient way of labelling a `Request` instance for quick identification later. You can access it with `request.label` and it's a `string`. You can name your requests any way you want. Here, we used the label `CATEGORY` to note that we're enqueueing pages that represent a category of products. The `enqueue_links()` function will add this label to all requests before enqueueing them to the `RequestQueue`. Why this is useful will become obvious in a minute. +You will see `label` used often throughout Crawlee, as it's a convenient way of labelling a <ApiLink to="class/Request">`Request`</ApiLink> instance for quick identification later. You can access it with `request.label` and it's a `string`. You can name your requests any way you want. Here, we used the label `CATEGORY` to note that we're enqueueing pages that represent a category of products. The <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> function will add this label to all requests before enqueueing them to the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>. Why this is useful will become obvious in a minute. ## Crawling the detail pages diff --git a/docs/introduction/06-scraping.mdx b/docs/introduction/06-scraping.mdx index 93cd64fdd..c37b0d48d 100644 --- a/docs/introduction/06-scraping.mdx +++ b/docs/introduction/06-scraping.mdx @@ -3,6 +3,8 @@ id: scraping title: Scraping --- +import ApiLink from '@site/src/components/ApiLink'; + In the [Real-world project](./real-world-project#choosing-the-data-you-need) chapter, you've created a list of the information you wanted to collect about the products in the example Warehouse store. Let's review that and figure out ways to access the data. - URL diff --git a/docs/introduction/07-saving-data.mdx b/docs/introduction/07-saving-data.mdx index d81b567cd..ca16ca735 100644 --- a/docs/introduction/07-saving-data.mdx +++ b/docs/introduction/07-saving-data.mdx @@ -3,11 +3,13 @@ id: saving-data title: Saving data --- +import ApiLink from '@site/src/components/ApiLink'; + A data extraction job would not be complete without saving the data for later use and processing. You've come to the final and most difficult part of this tutorial so make sure to pay attention very carefully! ## Save data to the dataset -Crawlee provides a `Dataset` class, which acts as an abstraction over tabular storage, making it useful for storing scraping results. First, add the following import to the top of your file: +Crawlee provides a <ApiLink to="class/Dataset">`Dataset`</ApiLink> class, which acts as an abstraction over tabular storage, making it useful for storing scraping results. First, add the following import to the top of your file: ```python from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext @@ -16,7 +18,7 @@ from crawlee.storages.dataset import Dataset # ... ``` -Next, under the section where you create an instance of your crawler, create an instance of the dataset using the asynchronous constructor `open()`: +Next, under the section where you create an instance of your crawler, create an instance of the dataset using the asynchronous constructor <ApiLink to="class/Dataset#open">`Dataset.open`</ApiLink>: ```python # ... @@ -53,7 +55,7 @@ Finally, instead of logging the extracted data to stdout, we can export them to ### Using a context helper -Instead of importing a new class and manually creating an instance of the dataset, you can use the context helper `push_data`. Remove the dataset import and instantiation, and replace `dataset.push_data` with the following: +Instead of importing a new class and manually creating an instance of the dataset, you can use the context helper <ApiLink to="class/PushDataFunction">`context.push_data`</ApiLink>. Remove the dataset import and instantiation, and replace `dataset.push_data` with the following: ```python # ... @@ -178,11 +180,11 @@ if __name__ == '__main__': ## What `push_data` does? -A helper `context.push_data()` saves data to the default dataset. You can provide additional arguments there like `id` or `name` to open a different dataset. Dataset is a storage designed to hold data in a format similar to a table. Each time you call `context.push_data()` or direct `Dataset.push_data()` a new row in the table is created, with the property names serving as column titles. In the default configuration, the rows are represented as JSON files saved on your file system, but other backend storage systems can be plugged into Crawlee as well. More on that later. +A helper <ApiLink to="class/PushDataFunction">`context.push_data`</ApiLink> saves data to the default dataset. You can provide additional arguments there like `id` or `name` to open a different dataset. Dataset is a storage designed to hold data in a format similar to a table. Each time you call <ApiLink to="class/PushDataFunction">`context.push_data`</ApiLink> or direct <ApiLink to="class/Dataset#push_data">`Dataset.push_data`</ApiLink> a new row in the table is created, with the property names serving as column titles. In the default configuration, the rows are represented as JSON files saved on your file system, but other backend storage systems can be plugged into Crawlee as well. More on that later. :::info Automatic dataset initialization -Each time you start Crawlee a default `Dataset` is automatically created, so there's no need to initialize it or create an instance first. You can create as many datasets as you want and even give them names. For more details see the `Dataset.open()` function. +Each time you start Crawlee a default <ApiLink to="class/Dataset">`Dataset`</ApiLink> is automatically created, so there's no need to initialize it or create an instance first. You can create as many datasets as you want and even give them names. For more details see the <ApiLink to="class/Dataset#open">`Dataset.open`</ApiLink> function. ::: @@ -190,7 +192,7 @@ Each time you start Crawlee a default `Dataset` is automatically created, so the :::info Automatic dataset initialization -Each time you start Crawlee a default `Dataset` is automatically created, so there's no need to initialize it or create an instance first. You can create as many datasets as you want and even give them names. For more details see the [Result storage guide](../guides/result-storage#dataset) and the `Dataset.open()` function. +Each time you start Crawlee a default <ApiLink to="class/Dataset">`Dataset`</ApiLink> is automatically created, so there's no need to initialize it or create an instance first. You can create as many datasets as you want and even give them names. For more details see the [Result storage guide](../guides/result-storage#dataset) and the `Dataset.open()` function. ::: */} @@ -203,7 +205,7 @@ Unless you changed the configuration that Crawlee uses locally, which would sugg {PROJECT_FOLDER}/storage/datasets/default/ ``` -The above folder will hold all your saved data in numbered files, as they were pushed into the dataset. Each file represents one invocation of `Dataset.push_data()` or one table row. +The above folder will hold all your saved data in numbered files, as they were pushed into the dataset. Each file represents one invocation of <ApiLink to="class/Dataset#push_data">`Dataset.push_data`</ApiLink> or one table row. {/* TODO: add mention of "Result storage guide" once it's ready: diff --git a/docs/introduction/08-refactoring.mdx b/docs/introduction/08-refactoring.mdx index 1cea554d3..1835c5ddf 100644 --- a/docs/introduction/08-refactoring.mdx +++ b/docs/introduction/08-refactoring.mdx @@ -3,6 +3,8 @@ id: refactoring title: Refactoring --- +import ApiLink from '@site/src/components/ApiLink'; + It may seem that the data is extracted and the crawler is done, but honestly, this is just the beginning. For the sake of brevity, we've completely omitted error handling, proxies, logging, architecture, tests, documentation and other stuff that a reliable software should have. The good thing is, error handling is mostly done by Crawlee itself, so no worries on that front, unless you need some custom magic. :::info Navigating automatic bot-protextion avoidance @@ -18,14 +20,14 @@ However, the default configuration, while powerful, may not cover every scenario If you want to learn more, browse the [Avoid getting blocked](../guides/avoid-blocking), [Proxy management](../guides/proxy-management) and [Session management](../guides/session-management) guides. */} -To promote good coding practices, let's look at how you can use a `Router` class to better structure your crawler code. +To promote good coding practices, let's look at how you can use a <ApiLink to="class/Router">`Router`</ApiLink> class to better structure your crawler code. ## Request routing In the following code, we've made several changes: - Split the code into multiple files. -- Added custom instance of `Router` to make our routing cleaner, without if clauses. +- Added custom instance of <ApiLink to="class/Router">`Router`</ApiLink> to make our routing cleaner, without if clauses. - Moved route definitions to a separate `routes.py` file. - Simplified the `main.py` file to focus on the general structure of the crawler. @@ -135,7 +137,7 @@ if __name__ == '__main__': asyncio.run(main()) ``` -By structuring your code this way, you achieve better separation of concerns, making the code easier to read, manage and extend. The `Router` class keeps your routing logic clean and modular, replacing if clauses with function decorators. +By structuring your code this way, you achieve better separation of concerns, making the code easier to read, manage and extend. The <ApiLink to="class/Router">`Router`</ApiLink> class keeps your routing logic clean and modular, replacing if clauses with function decorators. ## Summary diff --git a/docs/introduction/index.mdx b/docs/introduction/index.mdx index f57fc5347..dd322cca5 100644 --- a/docs/introduction/index.mdx +++ b/docs/introduction/index.mdx @@ -3,6 +3,8 @@ id: introduction title: Introduction --- +import ApiLink from '@site/src/components/ApiLink'; + Crawlee covers your crawling and scraping end-to-end and helps you **build reliable scrapers. Fast.** Your crawlers will appear human-like and fly under the radar of modern bot protections even with the default configuration. Crawlee gives you the tools to crawl the web for links, scrape data and persistently store it in machine-readable formats, without having to worry about the technical details. And thanks to rich configuration options, you can tweak almost any aspect of Crawlee to suit your project's needs if the default settings don't cut it. diff --git a/docs/quick-start/index.mdx b/docs/quick-start/index.mdx index 862d43e3a..28811b0fb 100644 --- a/docs/quick-start/index.mdx +++ b/docs/quick-start/index.mdx @@ -3,6 +3,7 @@ id: quick-start title: Quick start --- +import ApiLink from '@site/src/components/ApiLink'; import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; @@ -10,15 +11,19 @@ This short tutorial will help you start scraping with Crawlee in just a minute o ## Choose your crawler -Crawlee offers two main crawler classes: `BeautifulSoupCrawler`, and `PlaywrightCrawler`. All crawlers share the same interface, providing maximum flexibility when switching between them. +Crawlee offers the following main crawler classes: <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink>, and <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink>. All crawlers share the same interface, providing maximum flexibility when switching between them. ### BeautifulSoupCrawler -The `BeautifulSoupCrawler` is a plain HTTP crawler that parses HTML using the well-known [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library. It crawls the web using an HTTP client that mimics a browser. This crawler is very fast and efficient but cannot handle JavaScript rendering. +The <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> is a plain HTTP crawler that parses HTML using the well-known [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library. It crawls the web using an HTTP client that mimics a browser. This crawler is very fast and efficient but cannot handle JavaScript rendering. + +### ParselCrawler + +The <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> is similar to the <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> but uses the [Parsel](https://pypi.org/project/parsel/) library for HTML parsing. Parsel is a lightweight library that provides a CSS selector-based API for extracting data from HTML documents. If you are familiar with the [Scrapy](https://scrapy.org/) framework, you will feel right at home with Parsel. As with the <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>, the <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> cannot handle JavaScript rendering. ### PlaywrightCrawler -The `PlaywrightCrawler` uses a headless browser controlled by the [Playwright](https://playwright.dev/) library. It can manage Chromium, Firefox, Webkit, and other browsers. Playwright is the successor to the [Puppeteer](https://pptr.dev/) library and is becoming the de facto standard in headless browser automation. If you need a headless browser, choose Playwright. +The <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> uses a headless browser controlled by the [Playwright](https://playwright.dev/) library. It can manage Chromium, Firefox, Webkit, and other browsers. Playwright is the successor to the [Puppeteer](https://pptr.dev/) library and is becoming the de facto standard in headless browser automation. If you need a headless browser, choose Playwright. :::caution before you start