-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Implement ParselCrawler that adds support for Parsel #348
Conversation
# Simulate ImportError for parsel | ||
with mock.patch.dict('sys.modules', {'parsel': None}): | ||
# Invalidate ParselCrawler import | ||
sys.modules.pop('crawlee.parsel_crawler', None) | ||
sys.modules.pop('crawlee.parsel_crawler.parsel_crawler', None) | ||
|
||
with pytest.raises(ImportError) as import_error: | ||
from crawlee.parsel_crawler import ParselCrawler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice trick!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! 🙌
2d92749
to
bcaf1d2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice job, thank you! Just two comments to think about.
src/crawlee/parsel_crawler/types.py
Outdated
class ParselCrawlingContext(HttpCrawlingResult, BasicCrawlingContext): | ||
"""Crawling context used by ParselCrawler.""" | ||
|
||
parsel: Selector |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From a quick glance at the docs, I believe that the idiomatic name for this would be selector
. What do you think?
) | ||
|
||
# Regex for identifying email addresses on a webpage. | ||
EMAIL_REGEX = r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm conflicted about parsing e-mail addresses with regexp like this. The presented use case is cool, but there is almost certainly some mistake here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM; just docs:
- Could we add link to Parsel library?
- Could we add API reference link to
ParselCrawler
? - Could we mention the installation of the "parsel" extra?
Description
Issues
Testing
Checklist
CHANGELOG.md