feat: Implement ParselCrawler that adds support for Parsel (#348)

### Description  - Implemented ParselCrawler which adds support for [Parsel](https://github.com/scrapy/parsel) - Added unit tests for ParselCrawler - Added example in the docs for ParselCrawler usage ### Issues  - Closes: #335 ### Testing  - Testing example included in the docs. ### Checklist - [x] Changes are described in the `CHANGELOG.md` - [x] CI passed --------- Co-authored-by: Jan Buchar <[email protected]>
apify · Aug 7, 2024 · a3832e5 · a3832e5
1 parent b0fd5da
commit a3832e5
Show file tree

Hide file tree

Showing 8 changed files with 466 additions and 3 deletions.
diff --git a/docs/examples/parsel-crawler.mdx b/docs/examples/parsel-crawler.mdx
@@ -0,0 +1,50 @@
+---
+id: parsel-crawler
+title: Parsel crawler
+---
+
+This example shows how to use `ParselCrawler` to crawl a website or a list of URLs. Each URL is loaded using a plain HTTP request and the response is parsed using [Parsel](https://pypi.org/project/parsel/) library which supports CSS and XPath selectors for HTML responses and JMESPath for JSON responses. We can extract data from all kinds of complex HTML structures using XPath. In this example, we will use Parsel to crawl github.com and extract page title, URL and emails found in the webpage. The default handler will scrape data from the current webpage and enqueue all the links found in the webpage for continuous scraping.
+
+
+```python
+import asyncio
+
+from crawlee.parsel_crawler import ParselCrawler, ParselCrawlingContext
+
+
+async def main() -> None:
+    crawler = ParselCrawler(
+        # Limit the crawl to max requests. Remove or increase it for crawling all links.
+        max_requests_per_crawl=10,
+    )
+
+    # Regex for identifying email addresses on a webpage.
+    EMAIL_REGEX = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
+
+    # Define the default request handler, which will be called for every request.
+    @crawler.router.default_handler
+    async def request_handler(context: ParselCrawlingContext) -> None:
+        context.log.info(f'Processing {context.request.url} ...')
+
+        # Extract data from the page.
+        data = {
+            'url': context.request.url,
+            'title': context.parsel.xpath('//title/text()').get(),
+            'email_address_list': context.parsel.re(EMAIL_REGEX)
+        }
+
+        # Push the extracted data to the default dataset.
+        await context.push_data(data)
+
+        # Enqueue all links found on the page.
+        await context.enqueue_links()
+
+    # Run the crawler with the initial list of URLs.
+    await crawler.run(['https://github.com'])
+
+    # Export the entire dataset to a JSON file.
+    await crawler.export_data('results.json')
+
+if __name__ == '__main__':
+    asyncio.run(main())
+```
diff --git a/docs/introduction/01-setting-up.mdx b/docs/introduction/01-setting-up.mdx
@@ -28,12 +28,16 @@ pip install crawlee
 
 Additional, optional dependencies unlocking more features are shipped as package extras.
 
-If you plan to parse HTML and use CSS selectors, install `crawlee` with `beautifulsoup` extra:
+If you plan to parse HTML and use CSS selectors, install `crawlee` with either the `beautifulsoup` or `parsel` extra:
 
 ```sh
 pip install 'crawlee[beautifulsoup]'
 ```
 
+```sh
+pip install 'crawlee[parsel]'
+```
+
 If you plan to use a (headless) browser, install `crawlee` with the `playwright` extra:
 
 ```sh

diff --git a/docs/introduction/02-first-crawler.mdx b/docs/introduction/02-first-crawler.mdx
@@ -7,9 +7,10 @@ Now, you will build your first crawler. But before you do, let's briefly introdu
 
 ## How Crawlee works
 
-There are 2 main crawler classes available for use in Crawlee.
+There are 3 main crawler classes available for use in Crawlee.
 
 - `BeautifulSoupCrawler`
+- `ParselCrawler`
 - `PlaywrightCrawler`
 
 We'll talk about their differences later. Now, let's talk about what they have in common.
@@ -61,7 +62,7 @@ The code above is for illustration of the request queue concept. Soon you'll lea
 
 ### Building a BeautifulSoupCrawler
 
-Crawlee comes with two main crawler classes: `BeautifulSoupCrawler`, and `PlaywrightCrawler`. You can read their short descriptions in the [Quick start](../quick-start) lesson.
+Crawlee comes with thre main crawler classes: `BeautifulSoupCrawler`, `ParselCrawler`, and `PlaywrightCrawler`. You can read their short descriptions in the [Quick start](../quick-start) lesson.
 
 Unless you have a good reason to start with a different one, you should try building a `BeautifulSoupCrawler` first. It is an HTTP crawler with HTTP2 support, anti-blocking features and integrated HTML parser - [BeautifulSoup](https://pypi.org/project/beautifulsoup4/). It's fast, simple, cheap to run and does not require complicated dependencies. The only downside is that it won't work out of the box for websites which require JavaScript rendering. But you might not need JavaScript rendering at all, because many modern websites use server-side rendering.
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -68,6 +68,7 @@ sortedcollections = "^2.1.0"
 tldextract = "^5.1.2"
 typer = { version = "^0.12.3", extras = ["all"] }
 typing-extensions = "^4.1.0"
+parsel = { version = "^1.9.1", optional = true }
 
 [tool.poetry.group.dev.dependencies]
 build = "~1.2.0"
@@ -97,6 +98,7 @@ all = ["beautifulsoup4", "lxml", "html5lib", "curl-cffi", "playwright"]
 beautifulsoup = ["beautifulsoup4", "lxml", "html5lib"]
 curl-impersonate = ["curl-cffi"]
 playwright = ["playwright"]
+parsel = ["parsel"]
 
 [tool.poetry.scripts]
 crawlee = "crawlee.cli:cli"

diff --git a/src/crawlee/parsel_crawler/__init__.py b/src/crawlee/parsel_crawler/__init__.py
@@ -0,0 +1,10 @@
+try:
+    from .parsel_crawler import ParselCrawler
+    from .types import ParselCrawlingContext
+except ImportError as exc:
+    raise ImportError(
+        "To import anything from this subpackage, you need to install the 'parsel' extra."
+        "For example, if you use pip, run `pip install 'crawlee[parsel]'`.",
+    ) from exc
+
+__all__ = ['ParselCrawler', 'ParselCrawlingContext']
diff --git a/src/crawlee/parsel_crawler/parsel_crawler.py b/src/crawlee/parsel_crawler/parsel_crawler.py
@@ -0,0 +1,151 @@
+from __future__ import annotations
+
+import asyncio
+import logging
+from typing import TYPE_CHECKING, Any, AsyncGenerator, Iterable
+
+from parsel import Selector
+from typing_extensions import Unpack
+
+from crawlee._utils.blocked import RETRY_CSS_SELECTORS
+from crawlee._utils.urls import convert_to_absolute_url, is_url_absolute
+from crawlee.basic_crawler import BasicCrawler, BasicCrawlerOptions, ContextPipeline
+from crawlee.enqueue_strategy import EnqueueStrategy
+from crawlee.errors import SessionError
+from crawlee.http_clients.httpx import HttpxHttpClient
+from crawlee.http_crawler import HttpCrawlingContext
+from crawlee.models import BaseRequestData
+from crawlee.parsel_crawler.types import ParselCrawlingContext
+
+if TYPE_CHECKING:
+    from crawlee.types import AddRequestsKwargs, BasicCrawlingContext
+
+
+class ParselCrawler(BasicCrawler[ParselCrawlingContext]):
+    """A crawler that fetches the request URL using `httpx` and parses the result with `Parsel`."""
+
+    def __init__(
+        self,
+        *,
+        additional_http_error_status_codes: Iterable[int] = (),
+        ignore_http_error_status_codes: Iterable[int] = (),
+        **kwargs: Unpack[BasicCrawlerOptions[ParselCrawlingContext]],
+    ) -> None:
+        """Initialize the ParselCrawler.
+
+        Args:
+            additional_http_error_status_codes: HTTP status codes that should be considered errors (and trigger a retry)
+
+            ignore_http_error_status_codes: HTTP status codes that are normally considered errors but we want to treat
+                them as successful
+
+            kwargs: Arguments to be forwarded to the underlying BasicCrawler
+        """
+        kwargs['_context_pipeline'] = (
+            ContextPipeline()
+            .compose(self._make_http_request)
+            .compose(self._parse_http_response)
+            .compose(self._handle_blocked_request)
+        )
+
+        kwargs.setdefault(
+            'http_client',
+            HttpxHttpClient(
+                additional_http_error_status_codes=additional_http_error_status_codes,
+                ignore_http_error_status_codes=ignore_http_error_status_codes,
+            ),
+        )
+
+        kwargs.setdefault('_logger', logging.getLogger(__name__))
+
+        super().__init__(**kwargs)
+
+    async def _make_http_request(self, context: BasicCrawlingContext) -> AsyncGenerator[HttpCrawlingContext, None]:
+        result = await self._http_client.crawl(
+            request=context.request,
+            session=context.session,
+            proxy_info=context.proxy_info,
+            statistics=self._statistics,
+        )
+
+        yield HttpCrawlingContext(
+            request=context.request,
+            session=context.session,
+            proxy_info=context.proxy_info,
+            add_requests=context.add_requests,
+            send_request=context.send_request,
+            push_data=context.push_data,
+            log=context.log,
+            http_response=result.http_response,
+        )
+
+    async def _handle_blocked_request(
+        self, crawling_context: ParselCrawlingContext
+    ) -> AsyncGenerator[ParselCrawlingContext, None]:
+        if self._retry_on_blocked:
+            status_code = crawling_context.http_response.status_code
+
+            if crawling_context.session and crawling_context.session.is_blocked_status_code(status_code=status_code):
+                raise SessionError(f'Assuming the session is blocked based on HTTP status code {status_code}')
+
+            matched_selectors = [
+                selector
+                for selector in RETRY_CSS_SELECTORS
+                if crawling_context.selector.css(selector).get() is not None
+            ]
+
+            if matched_selectors:
+                raise SessionError(
+                    'Assuming the session is blocked - '
+                    f"HTTP response matched the following selectors: {'; '.join(matched_selectors)}"
+                )
+
+        yield crawling_context
+
+    async def _parse_http_response(
+        self,
+        context: HttpCrawlingContext,
+    ) -> AsyncGenerator[ParselCrawlingContext, None]:
+        parsel_selector = await asyncio.to_thread(lambda: Selector(body=context.http_response.read()))
+
+        async def enqueue_links(
+            *,
+            selector: str = 'a',
+            label: str | None = None,
+            user_data: dict[str, Any] | None = None,
+            **kwargs: Unpack[AddRequestsKwargs],
+        ) -> None:
+            kwargs.setdefault('strategy', EnqueueStrategy.SAME_HOSTNAME)
+
+            requests = list[BaseRequestData]()
+            user_data = user_data or {}
+
+            link: Selector
+            for link in parsel_selector.css(selector):
+                link_user_data = user_data
+
+                if label is not None:
+                    link_user_data.setdefault('label', label)
+
+                if (url := link.xpath('@href').get()) is not None:
+                    url = url.strip()
+
+                    if not is_url_absolute(url):
+                        url = str(convert_to_absolute_url(context.request.url, url))
+
+                    requests.append(BaseRequestData.from_url(url, user_data=link_user_data))
+
+            await context.add_requests(requests, **kwargs)
+
+        yield ParselCrawlingContext(
+            request=context.request,
+            session=context.session,
+            proxy_info=context.proxy_info,
+            enqueue_links=enqueue_links,
+            add_requests=context.add_requests,
+            send_request=context.send_request,
+            push_data=context.push_data,
+            log=context.log,
+            http_response=context.http_response,
+            selector=parsel_selector,
+        )
diff --git a/src/crawlee/parsel_crawler/types.py b/src/crawlee/parsel_crawler/types.py
@@ -0,0 +1,18 @@
+from __future__ import annotations
+
+from dataclasses import dataclass
+from typing import TYPE_CHECKING
+
+from crawlee.http_crawler import HttpCrawlingResult
+from crawlee.types import BasicCrawlingContext, EnqueueLinksFunction
+
+if TYPE_CHECKING:
+    from parsel import Selector
+
+
+@dataclass(frozen=True)
+class ParselCrawlingContext(HttpCrawlingResult, BasicCrawlingContext):
+    """Crawling context used by ParselCrawler."""
+
+    selector: Selector
+    enqueue_links: EnqueueLinksFunction