Skip to content

Commit

Permalink
docs: add links to API doc in suitable places (#449)
Browse files Browse the repository at this point in the history
### Description

- doc: add links to API doc in suitable places

### Issues

- Closes: #266

### Testing

- Doc website was rendered locally

### Checklist

- [x] CI passed
  • Loading branch information
vdusek authored Aug 22, 2024
1 parent ecfe491 commit ada0990
Show file tree
Hide file tree
Showing 23 changed files with 147 additions and 86 deletions.
6 changes: 3 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,7 +44,7 @@ Crawlee is available as the [`crawlee`](https://pypi.org/project/crawlee/) PyPI
pip install 'crawlee[all]'
```

Then, install the Playwright dependencies:
Then, install the [Playwright](https://playwright.dev/) dependencies:

```sh
playwright install
Expand Down Expand Up @@ -84,7 +84,7 @@ Here are some practical examples to help you get started with different types of

### BeautifulSoupCrawler

The `BeautifulSoupCrawler` downloads web pages using an HTTP library and provides HTML-parsed content to the user. It uses [HTTPX](https://pypi.org/project/httpx/) for HTTP communication and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) for parsing HTML. It is ideal for projects that require efficient extraction of data from HTML content. This crawler has very good performance since it does not use a browser. However, if you need to execute client-side JavaScript, to get your content, this is not going to be enough and you will need to use PlaywrightCrawler. Also if you want to use this crawler, make sure you install `crawlee` with `beautifulsoup` extra.
The [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler) downloads web pages using an HTTP library and provides HTML-parsed content to the user. By default it uses [`HttpxHttpClient`](https://crawlee.dev/python/api/class/HttpxHttpClient) for HTTP communication and [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) for parsing HTML. It is ideal for projects that require efficient extraction of data from HTML content. This crawler has very good performance since it does not use a browser. However, if you need to execute client-side JavaScript, to get your content, this is not going to be enough and you will need to use [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler). Also if you want to use this crawler, make sure you install `crawlee` with `beautifulsoup` extra.

```python
import asyncio
Expand Down Expand Up @@ -124,7 +124,7 @@ if __name__ == '__main__':

### PlaywrightCrawler

The `PlaywrightCrawler` uses a headless browser to download web pages and provides an API for data extraction. It is built on [Playwright](https://playwright.dev/), an automation library designed for managing headless browsers. It excels at retrieving web pages that rely on client-side JavaScript for content generation, or tasks requiring interaction with JavaScript-driven content. For scenarios where JavaScript execution is unnecessary or higher performance is required, consider using the `BeautifulSoupCrawler`. Also if you want to use this crawler, make sure you install `crawlee` with `playwright` extra.
The [`PlaywrightCrawler`](https://crawlee.dev/python/api/class/PlaywrightCrawler) uses a headless browser to download web pages and provides an API for data extraction. It is built on [Playwright](https://playwright.dev/), an automation library designed for managing headless browsers. It excels at retrieving web pages that rely on client-side JavaScript for content generation, or tasks requiring interaction with JavaScript-driven content. For scenarios where JavaScript execution is unnecessary or higher performance is required, consider using the [`BeautifulSoupCrawler`](https://crawlee.dev/python/api/class/BeautifulSoupCrawler). Also if you want to use this crawler, make sure you install `crawlee` with `playwright` extra.

```python
import asyncio
Expand Down
5 changes: 3 additions & 2 deletions docs/examples/add-data-to-dataset.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@ id: add-data-to-dataset
title: Add data to dataset
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

This example demonstrates how to store extracted data into datasets using the `context.push_data()` helper function. If the specified dataset does not already exist, it will be created automatically. Additionally, you can save data to custom datasets by providing `dataset_id` or `dataset_name` parameters to the `push_data` method.
This example demonstrates how to store extracted data into datasets using the <ApiLink to="class/PushDataFunction#open">`context.push_data`</ApiLink> helper function. If the specified dataset does not already exist, it will be created automatically. Additionally, you can save data to custom datasets by providing `dataset_id` or `dataset_name` parameters to the <ApiLink to="class/PushDataFunction#open">`push_data`</ApiLink> function.

<Tabs groupId="main">
<TabItem value="BeautifulSoupCrawler" label="BeautifulSoupCrawler">
Expand Down Expand Up @@ -99,7 +100,7 @@ Each item in the dataset will be stored in its own file within the following dir
{PROJECT_FOLDER}/storage/datasets/default/
```

For more control, you can also open a dataset manually using the asynchronous constructor `Dataset.open()` and interact with it directly:
For more control, you can also open a dataset manually using the asynchronous constructor <ApiLink to="class/Dataset#open">`Dataset.open`</ApiLink>

```python
from crawlee.storages import Dataset
Expand Down
4 changes: 3 additions & 1 deletion docs/examples/beautifulsoup-crawler.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ id: beautifulsoup-crawler
title: BeautifulSoup crawler
---

This example demonstrates how to use `BeautifulSoupCrawler` to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library and extract some data from it - the page title and all `<h1>`, `<h2>` and `<h3>` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code.
import ApiLink from '@site/src/components/ApiLink';

This example demonstrates how to use <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library and extract some data from it - the page title and all `<h1>`, `<h2>` and `<h3>` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code.

```python
import asyncio
Expand Down
6 changes: 4 additions & 2 deletions docs/examples/capture-screenshot-using-playwright.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@ id: capture-screenshots-using-playwright
title: Capture screenshots using Playwright
---

This example demonstrates how to capture screenshots of web pages using `PlaywrightCrawler` and store them in the key-value store.
import ApiLink from '@site/src/components/ApiLink';

The `PlaywrightCrawler` is configured to automate the browsing and interaction with web pages. It uses headless Chromium as the browser type to perform these tasks. Each web page specified in the initial list of URLs is visited sequentially, and a screenshot of the page is captured using Playwright's `page.screenshot()` method.
This example demonstrates how to capture screenshots of web pages using <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and store them in the key-value store.

The <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> is configured to automate the browsing and interaction with web pages. It uses headless Chromium as the browser type to perform these tasks. Each web page specified in the initial list of URLs is visited sequentially, and a screenshot of the page is captured using Playwright's `page.screenshot()` method.

The captured screenshots are stored in the key-value store, which is suitable for managing and storing files in various formats. In this case, screenshots are stored as PNG images with a unique key generated from the URL of the page.

Expand Down
3 changes: 2 additions & 1 deletion docs/examples/crawl-all-links-on-website.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@ id: crawl-all-links-on-website
title: Crawl all links on website
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

This example uses the `enqueue_links()` helper to add new links to the `RequestQueue` as the crawler navigates from page to page. By automatically discovering and enqueuing all links on a given page, the crawler can systematically scrape an entire website. This approach is ideal for web scraping tasks where you need to collect data from multiple interconnected pages.
This example uses the <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> helper to add new links to the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink> as the crawler navigates from page to page. By automatically discovering and enqueuing all links on a given page, the crawler can systematically scrape an entire website. This approach is ideal for web scraping tasks where you need to collect data from multiple interconnected pages.

:::tip

Expand Down
1 change: 1 addition & 0 deletions docs/examples/crawl-multiple-urls.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ id: crawl-multiple-urls
title: Crawl multiple URLs
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

Expand Down
3 changes: 2 additions & 1 deletion docs/examples/crawl-specific-links-on-website.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,11 @@ id: crawl-specific-links-on-website
title: Crawl specific links on website
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

This example demonstrates how to crawl a website while targeting specific patterns of links. By utilizing the `enqueue_links()` helper, you can pass `include` or `exclude` parameters to improve your crawling strategy. This approach ensures that only the links matching the specified patterns are added to the `RequestQueue`. Both `include` and `exclude` support lists of globs or regular expressions. This functionality is great for focusing on relevant sections of a website and avoiding scraping unnecessary or irrelevant content.
This example demonstrates how to crawl a website while targeting specific patterns of links. By utilizing the <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> helper, you can pass `include` or `exclude` parameters to improve your crawling strategy. This approach ensures that only the links matching the specified patterns are added to the <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>. Both `include` and `exclude` support lists of globs or regular expressions. This functionality is great for focusing on relevant sections of a website and avoiding scraping unnecessary or irrelevant content.

<Tabs groupId="main">
<TabItem value="BeautifulSoupCrawler" label="BeautifulSoupCrawler">
Expand Down
7 changes: 4 additions & 3 deletions docs/examples/crawl-website-with-relative-links.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,18 +3,19 @@ id: crawl-website-with-relative-links
title: Crawl website with relative links
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

When crawling a website, you may encounter various types of links that you wish to include in your crawl. To facilitate this, we provide the `enqueue_links()` method on the crawler context, which will automatically find and add these links to the crawler's `RequestQueue`. This method simplifies the process of handling different types of links, including relative links, by automatically resolving them based on the page's context.
When crawling a website, you may encounter various types of links that you wish to include in your crawl. To facilitate this, we provide the <ApiLink to="class/EnqueueLinksFunction">`enqueue_links`</ApiLink> method on the crawler context, which will automatically find and add these links to the crawler's <ApiLink to="class/RequestQueue">`RequestQueue`</ApiLink>. This method simplifies the process of handling different types of links, including relative links, by automatically resolving them based on the page's context.

:::note

For these examples, we are using the `BeautifulSoupCrawler`. However, the same method is available for the `PlaywrightCrawler` as well. You can use it in exactly the same way.
For these examples, we are using the <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. However, the same method is available for the <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> as well. You can use it in exactly the same way.

:::

We provide four distinct strategies for crawling relative links:
<ApiLink to="enum/EnqueueStrategy">`EnqueueStrategy`</ApiLink> enum provides four distinct strategies for crawling relative links:

- `EnqueueStrategy.All` - Enqueues all links found, regardless of the domain they point to. This strategy is useful when you want to follow every link, including those that navigate to external websites.
- `EnqueueStrategy.SAME_DOMAIN` - Enqueues all links found that share the same domain name, including any possible subdomains. This strategy ensures that all links within the same top-level and base domain are included.
Expand Down
5 changes: 3 additions & 2 deletions docs/examples/export-entire-dataset-to-file.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,15 @@ id: export-entire-dataset-to-file
title: Export entire dataset to file
---

import ApiLink from '@site/src/components/ApiLink';
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';

This example demonstrates how to use the `export_data()` method of the crawler to export the entire default dataset to a single file. This method supports exporting data in either CSV or JSON format.
This example demonstrates how to use the <ApiLink to="class/BasicCrawler#export_data">`BasicCrawler.export_data`</ApiLink> method of the crawler to export the entire default dataset to a single file. This method supports exporting data in either CSV or JSON format.

:::note

For these examples, we are using the `BeautifulSoupCrawler`. However, the same method is available for the `PlaywrightCrawler` as well. You can use it in exactly the same way.
For these examples, we are using the <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>. However, the same method is available for the <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> as well. You can use it in exactly the same way.

:::

Expand Down
4 changes: 3 additions & 1 deletion docs/examples/parsel-crawler.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@ id: parsel-crawler
title: Parsel crawler
---

This example shows how to use `ParselCrawler` to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using [Parsel](https://pypi.org/project/parsel/) library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping.
import ApiLink from '@site/src/components/ApiLink';

This example shows how to use <ApiLink to="class/ParselCrawler">`ParselCrawler`</ApiLink> to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using [Parsel](https://pypi.org/project/parsel/) library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping.


```python
Expand Down
6 changes: 4 additions & 2 deletions docs/examples/playwright-crawler.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,11 @@ id: playwright-crawler
title: Playwright crawler
---

This example demonstrates how to use `PlaywrightCrawler` to recursively scrape the Hacker news website using headless Chromium and Playwright.
import ApiLink from '@site/src/components/ApiLink';

The `PlaywrightCrawler` manages the browser and page instances, simplifying the process of interacting with web pages. In the request handler, Playwright's API is used to extract data from each post on the page. Specifically, it retrieves the title, rank, and URL of each post. Additionally, the handler enqueues links to the next pages to ensure continuous scraping. This setup is ideal for scraping dynamic web pages where JavaScript execution is required to render the content.
This example demonstrates how to use <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> to recursively scrape the Hacker news website using headless Chromium and Playwright.

The <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> manages the browser and page instances, simplifying the process of interacting with web pages. In the request handler, Playwright's API is used to extract data from each post on the page. Specifically, it retrieves the title, rank, and URL of each post. Additionally, the handler enqueues links to the next pages to ensure continuous scraping. This setup is ideal for scraping dynamic web pages where JavaScript execution is required to render the content.

```python
import asyncio
Expand Down
6 changes: 3 additions & 3 deletions docs/guides/http_clients.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ HTTP clients are utilized by the HTTP-based crawlers (e.g. <ApiLink to="class/Be

## How to switch between HTTP clients

In Crawlee we currently have two HTTP clients: <ApiLink to="class/HttpxHttpClient">`HttpxHttpClient`</ApiLink>, which uses the `httpx` library, and <ApiLink to="class/CurlImpersonateHttpClient">`CurlImpersonateHttpClient`</ApiLink>, which uses the `curl-cffi` library. You can switch between them by setting the `http_client` parameter in the Crawler class. The default HTTP client is `HttpxHttpClient`. Below are examples of how to set the HTTP client for the `BeautifulSoupCrawler`.
In Crawlee we currently have two HTTP clients: <ApiLink to="class/HttpxHttpClient">`HttpxHttpClient`</ApiLink>, which uses the `httpx` library, and <ApiLink to="class/CurlImpersonateHttpClient">`CurlImpersonateHttpClient`</ApiLink>, which uses the `curl-cffi` library. You can switch between them by setting the `http_client` parameter in the Crawler class. The default HTTP client is <ApiLink to="class/HttpxHttpClient">`HttpxHttpClient`</ApiLink>. Below are examples of how to set the HTTP client for the <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink>.

<Tabs>
<TabItem value="BeautifulSoupHttpxExample" label="BeautifulSoupCrawler with HTTPX">
Expand All @@ -33,7 +33,7 @@ In Crawlee we currently have two HTTP clients: <ApiLink to="class/HttpxHttpClien

### Installation

Since `HttpxHttpClient` is the default HTTP client, you don't need to install additional packages to use it. If you want to use `CurlImpersonateHttpClient`, you need to install `crawlee` with the `curl-impersonate` extra.
Since <ApiLink to="class/HttpxHttpClient">`HttpxHttpClient`</ApiLink> is the default HTTP client, you don't need to install additional packages to use it. If you want to use <ApiLink to="class/CurlImpersonateHttpClient">`CurlImpersonateHttpClient`</ApiLink>, you need to install `crawlee` with the `curl-impersonate` extra.

```sh
pip install 'crawlee[curl-impersonate]'
Expand All @@ -47,4 +47,4 @@ pip install 'crawlee[all]'

## How HTTP clients work

We provide an abstract base class, <ApiLink to="class/BaseHttpClient">`BaseHttpClient`</ApiLink>, which defines the necessary interface for all HTTP clients. HTTP clients are responsible for sending requests and receiving responses, as well as managing cookies, headers, and proxies. They provide methods that are called from crawlers. To implement your own HTTP client, inherit from the `BaseHttpClient` class and implement the required methods.
We provide an abstract base class, <ApiLink to="class/BaseHttpClient">`BaseHttpClient`</ApiLink>, which defines the necessary interface for all HTTP clients. HTTP clients are responsible for sending requests and receiving responses, as well as managing cookies, headers, and proxies. They provide methods that are called from crawlers. To implement your own HTTP client, inherit from the <ApiLink to="class/BaseHttpClient">`BaseHttpClient`</ApiLink> class and implement the required methods.
Loading

0 comments on commit ada0990

Please sign in to comment.