Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

create DuplicateUrlDiscarderPipeline #17

Merged
merged 11 commits into from
Jul 19, 2024
Merged

create DuplicateUrlDiscarderPipeline #17

merged 11 commits into from
Jul 19, 2024

Conversation

BurnzZ
Copy link
Member

@BurnzZ BurnzZ commented Jul 17, 2024

No description provided.

@BurnzZ BurnzZ marked this pull request as ready for review July 17, 2024 09:25
README.rst Outdated Show resolved Hide resolved
README.rst Outdated Show resolved Hide resolved
tests/test_pipelines.py Outdated Show resolved Hide resolved
Comment on lines 29 to 45
def test_item_signature() -> None:
@dataclass
class FakeItem:
name: str

value = "fake_item"
item = FakeItem(value)
adapter = ItemAdapter(item)

assert item_signature(adapter, ["name"]) == hash(value)

exception_text = (
"Got type <class 'tests.test_utils.test_item_signature.<locals>.FakeItem'> "
"but expected ItemAdapter."
)
with pytest.raises(ValueError, match=exception_text):
item_signature(item, ["name"]) # type: ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep the current implementation, maybe we could add an xfail test for:

item1 = {"a": "a|", "b": "b"}
item2 = {"a": "a", "b": "|b"}
fields = ["a", "b"]
signature = "a||b"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very interesting edge case! ✨

Let's go with adding the attrib name (or dict key) in the hash calculation. As this would mean that the left-hand value is guaranteed to be unique since it's being used as a dict-key or an attribute name. afd2a0f

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then:

item1 = {"a": "a|b:b", "b": "b"}
item2 = {"a": "a", "b": "b|b:b"}
fields = ["a", "b"]
signature = "a:a|b:b|b:b"

🙂

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😆

Tempted to go with the route of closer to json.dumps() but still faster:

>>> python -m timeit "item = {'a':1,'b':2,'c':3}; import json; hash(json.dumps({k:item.get(k) for k in ['a', 'b', 'c']}))"
200000 loops, best of 5: 1.41 usec per loop

>>> python -m timeit "item = {'a':1,'b':2,'c':3}; hash('|'.join(f'{k}:{item.get(k)}' for k in ['a', 'b', 'c']))"
500000 loops, best of 5: 599 nsec per loop

>>> python -m timeit "item = {'a':1,'b':2,'c':3}; hash(','.join(f'\"{k}\": \"{item.get(k)}\"' for k in ['a', 'b', 'c']))"
500000 loops, best of 5: 649 nsec per loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think the original implementation was OK, it would be extremely unlikely for this to be an issue with real data, and if that ever becomes a concern, we could always allow customizing | into something else. But I think it makes sense to have an xfail test, as a “we know this can happen in theory” statement in code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added xfail in c562bc6. Cheers again for raising!

]
DUD_LOAD_RULE_PATHS = [
"/home/user/project/custom_rules1.json",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be helpful to explain how to use locally defined rules with spiders running in Scrapy Cloud?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. And how to combine custom rules with the default ones. But it may be out of scope for this PR, I think.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this PR doesn't change this part of the doc except formatting but it's nice to have as a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for raising @PyExplorer . Created #19 for tracking.

@BurnzZ
Copy link
Member Author

BurnzZ commented Jul 19, 2024

Merging this one. Thanks for the review folks!

@BurnzZ BurnzZ merged commit d0bdb6d into main Jul 19, 2024
8 checks passed
@wRAR wRAR deleted the pipeline branch October 15, 2024 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants