Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Implement ParselCrawler that adds support for Parsel #348

Merged
merged 8 commits into from
Aug 7, 2024

Conversation

asymness
Copy link
Contributor

Description

  • Implemented ParselCrawler which adds support for Parsel
  • Added unit tests for ParselCrawler
  • Added example in the docs for ParselCrawler usage

Issues

Testing

  • Testing example included in the docs.

Checklist

  • Changes are described in the CHANGELOG.md
  • CI passed

@asymness asymness marked this pull request as draft July 23, 2024 14:26
Comment on lines 215 to 222
# Simulate ImportError for parsel
with mock.patch.dict('sys.modules', {'parsel': None}):
# Invalidate ParselCrawler import
sys.modules.pop('crawlee.parsel_crawler', None)
sys.modules.pop('crawlee.parsel_crawler.parsel_crawler', None)

with pytest.raises(ImportError) as import_error:
from crawlee.parsel_crawler import ParselCrawler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! 🙌

@asymness asymness marked this pull request as ready for review July 23, 2024 18:09
@asymness asymness requested a review from janbuchar July 23, 2024 18:09
Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job, thank you! Just two comments to think about.

class ParselCrawlingContext(HttpCrawlingResult, BasicCrawlingContext):
"""Crawling context used by ParselCrawler."""

parsel: Selector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From a quick glance at the docs, I believe that the idiomatic name for this would be selector. What do you think?

)

# Regex for identifying email addresses on a webpage.
EMAIL_REGEX = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm conflicted about parsing e-mail addresses with regexp like this. The presented use case is cool, but there is almost certainly some mistake here.

@fnesveda fnesveda added the t-tooling Issues with this label are in the ownership of the tooling team. label Jul 31, 2024
Copy link
Collaborator

@vdusek vdusek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM; just docs:

  • Could we add link to Parsel library?
  • Could we add API reference link to ParselCrawler?
  • Could we mention the installation of the "parsel" extra?

@janbuchar janbuchar merged commit a3832e5 into apify:master Aug 7, 2024
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add support for Parsel
4 participants