Skip to content

Commit

Permalink
docs: Deployment guide (#482)
Browse files Browse the repository at this point in the history
Added a doc page about running crawlers on Apify using the SDK. I tested
it, fixes will be required before publishing this.
  • Loading branch information
janbuchar authored Sep 11, 2024
1 parent 6581700 commit 66f5253
Show file tree
Hide file tree
Showing 15 changed files with 1,143 additions and 25 deletions.
252 changes: 252 additions & 0 deletions docs/deployment/apify_platform.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,252 @@
---
id: apify-platform
title: Apify platform
description: Apify platform - large-scale and high-performance web scraping
---

import ApiLink from '@site/src/components/ApiLink';

import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import CodeBlock from '@theme/CodeBlock';

import MainSource from '!!raw-loader!./code/apify_platform_main.py';
import GetPublicUrlSource from '!!raw-loader!./code/apify_platform_get_public_url.py';

Apify is a [platform](https://apify.com) built to serve large-scale and high-performance web scraping and automation needs. It provides easy access to [compute instances (Actors)](#what-is-an-actor), convenient request and result storages, [proxies](../guides/proxy-management), scheduling, webhooks and [more](https://docs.apify.com/), accessible through a [web interface](https://console.apify.com) or an [API](https://docs.apify.com/api).

While we think that the Apify platform is super cool, and it's definitely worth signing up for a [free account](https://console.apify.com/sign-up), **Crawlee is and will always be open source**, runnable locally or on any cloud infrastructure.

:::note

We do not test Crawlee in other cloud environments such as Lambda or on specific architectures such as Raspberry PI. We strive to make it work, but there are no guarantees.

:::

## Logging into Apify platform from Crawlee

To access your [Apify account](https://console.apify.com/sign-up) from Crawlee, you must provide credentials - your [API token](https://console.apify.com/account?tab=integrations). You can do that either by utilizing [Apify CLI](https://github.com/apify/apify-cli) or with environment variables.

Once you provide credentials to your Apify CLI installation, you will be able to use all the Apify platform features, such as calling Actors, saving to cloud storages, using Apify proxies, setting up webhooks and so on.

### Log in with CLI

Apify CLI allows you to log in to your Apify account on your computer. If you then run your crawler using the CLI, your credentials will automatically be added.

```bash
npm install -g apify-cli
apify login -t YOUR_API_TOKEN
```

### Log in with environment variables

Alternatively, you can always provide credentials to your Actor by setting the [`APIFY_TOKEN`](#apify_token) environment variable to your API token.

> There's also the [`APIFY_PROXY_PASSWORD`](#apify_proxy_password)
> environment variable. Actor automatically infers that from your token, but it can be useful
> when you need to access proxies from a different account than your token represents.
### Log in with Configuration

Another option is to use the [`Configuration`](https://docs.apify.com/sdk/python/reference/class/Configuration) instance and set your api token there.

```python
from apify import Actor

sdk = Actor(Configuration(token='your_apify_token'));
```

## What is an Actor

When you deploy your script to the Apify platform, it becomes an [Actor](https://apify.com/actors). An Actor is a serverless microservice that accepts an input and produces an output. It can run for a few seconds, hours or even infinitely. An Actor can perform anything from a simple action such as filling out a web form or sending an email, to complex operations such as crawling an entire website and removing duplicates from a large dataset.

Actors can be shared in the [Apify Store](https://apify.com/store) so that other people can use them. But don't worry, if you share your Actor in the store and somebody uses it, it runs under their account, not yours.

**Related links**

- [Store of existing Actors](https://apify.com/store)
- [Documentation](https://docs.apify.com/actors)
- [View Actors in Apify Console](https://console.apify.com/actors)
- [API reference](https://apify.com/docs/api/v2#/reference/actors)

## Running an Actor locally

First let's create a boilerplate of the new Actor. You could use Apify CLI and just run:

```bash
apify create my-hello-world
```

The CLI will prompt you to select a project boilerplate template - let's pick "Crawlee + BeautifulSoup". The tool will create a directory called `my-hello-world` with Python project files. You can run the Actor as follows:

```bash
cd my-hello-world
apify run
```

## Running Crawlee code as an Actor

For running Crawlee code as an Actor on [Apify platform](https://apify.com/actors) you need to wrap the body of the main function of your crawler with `async with Actor`.

:::info NOTE
Adding `async with Actor` is the only important thing needed to run it on Apify platform as an Actor. It is needed to initialize your Actor (e.g. to set the correct storage implementation) and to correctly handle exitting the process.
:::

Let's look at the `BeautifulSoupCrawler` example from the [Quick Start](../quick-start) guide:

<CodeBlock language="python">
{MainSource}
</CodeBlock>

Note that you could also run your Actor (that is using Crawlee) locally with Apify CLI. You could start it via the following command in your project folder:

```bash
apify run
```

## Deploying an Actor to Apify platform

Now (assuming you are already logged in to your Apify account) you can easily deploy your code to the Apify platform by running:

```bash
apify push
```

Your script will be uploaded to and built on the Apify platform so that it can be run there. For more information, view the
[Apify Actor](https://docs.apify.com/cli) documentation.

## Usage on Apify platform

You can also develop your Actor in an online code editor directly on the platform (you'll need an Apify Account). Let's go to the [Actors](https://console.apify.com/actors) page in the app, click *Create new* and then go to the *Source* tab and start writing the code or paste one of the examples from the [Examples](../examples) section.

## Storages

There are several things worth mentioning here.

### Helper functions for default Key-Value Store and Dataset

To simplify access to the _default_ storages, instead of using the helper functions of respective storage classes, you could use:
- [`Actor.set_value()`](https://docs.apify.com/sdk/python/reference/class/Actor#set_value), [`Actor.get_value()`](https://docs.apify.com/sdk/python/reference/class/Actor#get_value), [`Actor.get_input()`](https://docs.apify.com/sdk/python/reference/class/Actor#get_input) for [`Key-Value Store`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore)
- [`Actor.push_data()`](https://docs.apify.com/sdk/python/reference/class/Actor#push_data) for [`Dataset`](https://docs.apify.com/sdk/python/reference/class/Dataset)

### Using platform storage in a local Actor

When you plan to use the platform storage while developing and running your Actor locally, you should use [`Actor.open_key_value_store()`](https://docs.apify.com/sdk/python/reference/class/Actor#open_key_value_store), [`Actor.open_dataset()`](https://docs.apify.com/sdk/python/reference/class/Actor#open_dataset) and [`Actor.open_request_queue()`](https://docs.apify.com/sdk/python/reference/class/Actor#open_request_queue) to open the respective storage.

Using each of these methods allows to pass the `force_cloud` keyword argument. If set to `True`, cloud storage will be used instead of the folder on the local disk.

:::note
If you don't plan to force usage of the platform storages when running the Actor locally, there is no need to use the [`Actor`](https://docs.apify.com/sdk/python/reference/class/Actor) class for it. The Crawlee variants <ApiLink to="class/KeyValueStore#open">`KeyValueStore.open()`</ApiLink>, <ApiLink to="class/Dataset#open">`Dataset.open()`</ApiLink> and <ApiLink to="class/RequestQueue#open">`RequestQueue.open()`</ApiLink> will work the same.
:::

{/*
### Getting public url of an item in the platform storage
If you need to share a link to some file stored in a [Key-Value](https://docs.apify.com/sdk/python/reference/class/KeyValueStore) Store on Apify Platform, you can use [`get_public_url()`](https://docs.apify.com/sdk/python/reference/class/KeyValueStore#get_public_url) method. It accepts only one parameter: `key` - the key of the item you want to share.
<CodeBlock language="python">
{GetPublicUrlSource}
</CodeBlock>
*/}

### Exporting dataset data

When the <ApiLink to="class/Dataset">`Dataset`</ApiLink> is stored on the [Apify platform](https://apify.com/actors), you can export its data to the following formats: HTML, JSON, CSV, Excel, XML and RSS. The datasets are displayed on the Actor run details page and in the [Storage](https://console.apify.com/storage) section in the Apify Console. The actual data is exported using the [Get dataset items](https://apify.com/docs/api/v2#/reference/datasets/item-collection/get-items) Apify API endpoint. This way you can easily share the crawling results.

**Related links**

- [Apify platform storage documentation](https://docs.apify.com/storage)
- [View storage in Apify Console](https://console.apify.com/storage)
- [Key-value stores API reference](https://apify.com/docs/api/v2#/reference/key-value-stores)
- [Datasets API reference](https://docs.apify.com/api/v2#/reference/datasets)
- [Request queues API reference](https://docs.apify.com/api/v2#/reference/request-queues)

## Environment variables

The following describes select environment variables set by the Apify platform. For a complete list, see the [Environment variables](https://docs.apify.com/platform/actors/development/programming-interface/environment-variables) section in the Apify platform documentation.

:::note

It's important to notice that `CRAWLEE_` environment variables don't need to be replaced with equivalent `APIFY_` ones. Likewise, Crawlee understands `APIFY_` environment variables.

:::

### `APIFY_TOKEN`

The API token for your Apify account. It is used to access the Apify API, e.g. to access cloud storage
or to run an Actor on the Apify platform. You can find your API token on the
[Account Settings / Integrations](https://console.apify.com/account?tab=integrations) page.

### Combinations of `APIFY_TOKEN` and `CRAWLEE_STORAGE_DIR`

By combining the env vars in various ways, you can greatly influence the Actor's behavior.

| Env Vars | API | Storages |
| --------------------------------------- | --- | ---------------- |
| none OR `CRAWLEE_STORAGE_DIR` | no | local |
| `APIFY_TOKEN` | yes | Apify platform |
| `APIFY_TOKEN` AND `CRAWLEE_STORAGE_DIR` | yes | local + platform |

When using both `APIFY_TOKEN` and `CRAWLEE_STORAGE_DIR`, you can use all the Apify platform
features and your data will be stored locally by default. If you want to access platform storages,
you can use the `force_cloud=true` option in their respective functions.

### `APIFY_PROXY_PASSWORD`

Optional password to [Apify Proxy](https://docs.apify.com/proxy) for IP address rotation.
Assuming Apify Account was already created, you can find the password on the [Proxy page](https://console.apify.com/proxy)
in the Apify Console. The password is automatically inferred using the `APIFY_TOKEN` env var,
so in most cases, you don't need to touch it. You should use it when, for some reason,
you need access to Apify Proxy, but not access to Apify API, or when you need access to
proxy from a different account than your token represents.

## Proxy management

In addition to your own proxy servers and proxy servers acquired from
third-party providers used together with Crawlee, you can also rely on [Apify Proxy](https://apify.com/proxy)
for your scraping needs.

### Apify Proxy

If you are already subscribed to Apify Proxy, you can start using them immediately in only a few lines of code (for local usage you first should be [logged in](#logging-into-apify-platform-from-crawlee) to your Apify account.

```python
from apify import Actor

proxy_configuration = await Actor.create_proxy_configuration()
proxy_url = await proxy_configuration.new_url();
```

Note that unlike using your own proxies in Crawlee, you shouldn't use the constructor to create <ApiLink to="class/ProxyConfiguration">`ProxyConfiguration`</ApiLink> instances. For using the Apify Proxy you should create an instance using the [`Actor.create_proxy_configuration()`](https://docs.apify.com/sdk/python/reference/class/Actor#create_proxy_configuration) function instead.

### Apify Proxy Configuration

With Apify Proxy, you can select specific proxy groups to use, or countries to connect from.
This allows you to get better proxy performance after some initial research.

```python
import { Actor } from 'apify';

proxy_configuration = await Actor.create_proxy_configuration(
groups=['RESIDENTIAL'],
country_code='US',
);
proxy_url = await proxy_configuration.new_url();
```

Now your crawlers will use only Residential proxies from the US. Note that you must first get access
to a proxy group before you are able to use it. You can check proxy groups available to you
in the [proxy dashboard](https://console.apify.com/proxy).

### Apify Proxy vs. Own proxies

The [`ProxyConfiguration`](https://docs.apify.com/sdk/python/reference/class/ProxyConfiguration) class covers both Apify Proxy and custom proxy URLs so that you can easily switch between proxy providers. However, some features of the class are available only to Apify Proxy users, mainly because Apify Proxy is what one would call a super-proxy. It's not a single proxy server, but an API endpoint that allows connection through millions of different IP addresses. So the class essentially has two modes: Apify Proxy or Own (third party) proxy.

The difference is easy to remember.
- If you're using your own proxies - you should create a <ApiLink to="class/ProxyConfiguration">`ProxyConfiguration`</ApiLink> instance directly.
- If you are planning to use Apify Proxy - you should create an instance using the [`Actor.create_proxy_configuration()`](https://docs.apify.com/sdk/python/reference/class/Actor#create_proxy_configuration) function. The `new_url_function` parameter enables the use of your custom proxy URLs, whereas all the other options are there to configure Apify Proxy.

**Related links**

- [Apify Proxy docs](https://docs.apify.com/proxy)
8 changes: 8 additions & 0 deletions docs/deployment/code/apify_platform_get_public_url.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
from apify import Actor


async def main() -> None:
store = await Actor.open_key_value_store()
await store.set_value('your-file', {'foo': 'bar'})
# url = store.get_public_url('your-file') # noqa: ERA001
# https://api.apify.com/v2/key-value-stores/<your-store-id>/records/your-file
32 changes: 32 additions & 0 deletions docs/deployment/code/apify_platform_main.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
import asyncio

from apify import Actor

from crawlee import Glob
from crawlee.beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext


async def main() -> None:
async with Actor:
crawler = BeautifulSoupCrawler()

@crawler.router.default_handler
async def request_handler(context: BeautifulSoupCrawlingContext) -> None:
url = context.request.url

# Extract HTML title of the page.
title_element = context.soup.find('title')
title = title_element.text if title_element else ''
context.log.info(f'Title of {url}: {title}')

# Add URLs that match the provided pattern.
await context.enqueue_links(include=[Glob('https://www.iana.org/*')])

# Save extracted data to dataset.
await context.push_data({'url': url, 'title': title})

# Enqueue the initial request and run the crawler
await crawler.run(['https://www.iana.org/'])


asyncio.run(main())
Loading

0 comments on commit 66f5253

Please sign in to comment.