-
Notifications
You must be signed in to change notification settings - Fork 264
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
docs: move code samples to separate files (#457)
### Description - Move code samples to separate files (to be able to refactor, lint, type-checked, ..., them) - Till now this was done only for the guides section, in this PR I extended it to all doc files. ### Issues - N/A ### Testing - Doc was rendered locally, working. ### Checklist - [x] CI passed
- Loading branch information
Showing
78 changed files
with
1,440 additions
and
1,356 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
--- | ||
id: add-data-to-dataset | ||
title: Add data to dataset | ||
--- | ||
|
||
import ApiLink from '@site/src/components/ApiLink'; | ||
import Tabs from '@theme/Tabs'; | ||
import TabItem from '@theme/TabItem'; | ||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
import BeautifulSoupExample from '!!raw-loader!./code/add_data_to_dataset_bs.py'; | ||
import PlaywrightExample from '!!raw-loader!./code/add_data_to_dataset_pw.py'; | ||
import DatasetExample from '!!raw-loader!./code/add_data_to_dataset_dataset.py'; | ||
|
||
This example demonstrates how to store extracted data into datasets using the <ApiLink to="class/PushDataFunction#open">`context.push_data`</ApiLink> helper function. If the specified dataset does not already exist, it will be created automatically. Additionally, you can save data to custom datasets by providing `dataset_id` or `dataset_name` parameters to the <ApiLink to="class/PushDataFunction#open">`push_data`</ApiLink> function. | ||
|
||
<Tabs groupId="main"> | ||
<TabItem value="BeautifulSoupCrawler" label="BeautifulSoupCrawler"> | ||
<CodeBlock className="language-python"> | ||
{BeautifulSoupExample} | ||
</CodeBlock> | ||
</TabItem> | ||
<TabItem value="PlaywrightCrawler" label="PlaywrightCrawler"> | ||
<CodeBlock className="language-python"> | ||
{PlaywrightExample} | ||
</CodeBlock> | ||
</TabItem> | ||
</Tabs> | ||
|
||
Each item in the dataset will be stored in its own file within the following directory: | ||
|
||
```text | ||
{PROJECT_FOLDER}/storage/datasets/default/ | ||
``` | ||
|
||
For more control, you can also open a dataset manually using the asynchronous constructor <ApiLink to="class/Dataset#open">`Dataset.open`</ApiLink> | ||
|
||
<CodeBlock className="language-python"> | ||
{DatasetExample} | ||
</CodeBlock> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
--- | ||
id: beautifulsoup-crawler | ||
title: BeautifulSoup crawler | ||
--- | ||
|
||
import ApiLink from '@site/src/components/ApiLink'; | ||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
import BeautifulSoupExample from '!!raw-loader!./code/beautifulsoup_crawler.py'; | ||
|
||
This example demonstrates how to use <ApiLink to="class/BeautifulSoupCrawler">`BeautifulSoupCrawler`</ApiLink> to crawl a list of URLs, load each URL using a plain HTTP request, parse the HTML using the [BeautifulSoup](https://pypi.org/project/beautifulsoup4/) library and extract some data from it - the page title and all `<h1>`, `<h2>` and `<h3>` tags. This setup is perfect for scraping specific elements from web pages. Thanks to the well-known BeautifulSoup, you can easily navigate the HTML structure and retrieve the data you need with minimal code. | ||
|
||
<CodeBlock className="language-python"> | ||
{BeautifulSoupExample} | ||
</CodeBlock> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
--- | ||
id: capture-screenshots-using-playwright | ||
title: Capture screenshots using Playwright | ||
--- | ||
|
||
import ApiLink from '@site/src/components/ApiLink'; | ||
import CodeBlock from '@theme/CodeBlock'; | ||
|
||
import CaptureScreenshotExample from '!!raw-loader!./code/capture_screenshot_using_playwright.py'; | ||
|
||
This example demonstrates how to capture screenshots of web pages using <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> and store them in the key-value store. | ||
|
||
The <ApiLink to="class/PlaywrightCrawler">`PlaywrightCrawler`</ApiLink> is configured to automate the browsing and interaction with web pages. It uses headless Chromium as the browser type to perform these tasks. Each web page specified in the initial list of URLs is visited sequentially, and a screenshot of the page is captured using Playwright's `page.screenshot()` method. | ||
|
||
The captured screenshots are stored in the key-value store, which is suitable for managing and storing files in various formats. In this case, screenshots are stored as PNG images with a unique key generated from the URL of the page. | ||
|
||
<CodeBlock className="language-python"> | ||
{CaptureScreenshotExample} | ||
</CodeBlock> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
import asyncio | ||
|
||
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext | ||
|
||
|
||
async def main() -> None: | ||
crawler = BeautifulSoupCrawler() | ||
|
||
# Define the default request handler, which will be called for every request. | ||
@crawler.router.default_handler | ||
async def request_handler(context: BeautifulSoupCrawlingContext) -> None: | ||
context.log.info(f'Processing {context.request.url} ...') | ||
|
||
# Extract data from the page. | ||
data = { | ||
'url': context.request.url, | ||
'title': context.soup.title.string if context.soup.title else None, | ||
'html': str(context.soup)[:1000], | ||
} | ||
|
||
# Push the extracted data to the default dataset. | ||
await context.push_data(data) | ||
|
||
# Run the crawler with the initial list of requests. | ||
await crawler.run( | ||
[ | ||
'https://crawlee.dev', | ||
'https://apify.com', | ||
'https://example.com', | ||
] | ||
) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
from crawlee.storages import Dataset | ||
|
||
|
||
async def main() -> None: | ||
# Open dataset manually using asynchronous constructor open(). | ||
dataset = await Dataset.open() | ||
|
||
# Interact with dataset directly. | ||
await dataset.push_data({'key': 'value'}) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,35 @@ | ||
import asyncio | ||
|
||
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext | ||
|
||
|
||
async def main() -> None: | ||
crawler = PlaywrightCrawler() | ||
|
||
# Define the default request handler, which will be called for every request. | ||
@crawler.router.default_handler | ||
async def request_handler(context: PlaywrightCrawlingContext) -> None: | ||
context.log.info(f'Processing {context.request.url} ...') | ||
|
||
# Extract data from the page. | ||
data = { | ||
'url': context.request.url, | ||
'title': await context.page.title(), | ||
'html': str(await context.page.content())[:1000], | ||
} | ||
|
||
# Push the extracted data to the default dataset. | ||
await context.push_data(data) | ||
|
||
# Run the crawler with the initial list of requests. | ||
await crawler.run( | ||
[ | ||
'https://crawlee.dev', | ||
'https://apify.com', | ||
'https://example.com', | ||
] | ||
) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
import asyncio | ||
|
||
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext | ||
|
||
|
||
async def main() -> None: | ||
crawler = BeautifulSoupCrawler( | ||
# Limit the crawl to max requests. Remove or increase it for crawling all links. | ||
max_requests_per_crawl=10, | ||
) | ||
|
||
# Define the default request handler, which will be called for every request. | ||
@crawler.router.default_handler | ||
async def request_handler(context: BeautifulSoupCrawlingContext) -> None: | ||
context.log.info(f'Processing {context.request.url} ...') | ||
|
||
# Enqueue all links found on the page. | ||
await context.enqueue_links() | ||
|
||
# Run the crawler with the initial list of requests. | ||
await crawler.run(['https://crawlee.dev']) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
import asyncio | ||
|
||
from crawlee.playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext | ||
|
||
|
||
async def main() -> None: | ||
crawler = PlaywrightCrawler( | ||
# Limit the crawl to max requests. Remove or increase it for crawling all links. | ||
max_requests_per_crawl=10, | ||
) | ||
|
||
# Define the default request handler, which will be called for every request. | ||
@crawler.router.default_handler | ||
async def request_handler(context: PlaywrightCrawlingContext) -> None: | ||
context.log.info(f'Processing {context.request.url} ...') | ||
|
||
# Enqueue all links found on the page. | ||
await context.enqueue_links() | ||
|
||
# Run the crawler with the initial list of requests. | ||
await crawler.run(['https://crawlee.dev']) | ||
|
||
|
||
if __name__ == '__main__': | ||
asyncio.run(main()) |
Oops, something went wrong.