-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
create DuplicateUrlDiscarderPipeline #17
Conversation
tests/test_utils.py
Outdated
def test_item_signature() -> None: | ||
@dataclass | ||
class FakeItem: | ||
name: str | ||
|
||
value = "fake_item" | ||
item = FakeItem(value) | ||
adapter = ItemAdapter(item) | ||
|
||
assert item_signature(adapter, ["name"]) == hash(value) | ||
|
||
exception_text = ( | ||
"Got type <class 'tests.test_utils.test_item_signature.<locals>.FakeItem'> " | ||
"but expected ItemAdapter." | ||
) | ||
with pytest.raises(ValueError, match=exception_text): | ||
item_signature(item, ["name"]) # type: ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we keep the current implementation, maybe we could add an xfail test for:
item1 = {"a": "a|", "b": "b"}
item2 = {"a": "a", "b": "|b"}
fields = ["a", "b"]
signature = "a||b"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very interesting edge case! ✨
Let's go with adding the attrib name (or dict key) in the hash calculation. As this would mean that the left-hand value is guaranteed to be unique since it's being used as a dict-key or an attribute name. afd2a0f
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then:
item1 = {"a": "a|b:b", "b": "b"}
item2 = {"a": "a", "b": "b|b:b"}
fields = ["a", "b"]
signature = "a:a|b:b|b:b"
🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😆
Tempted to go with the route of closer to json.dumps()
but still faster:
>>> python -m timeit "item = {'a':1,'b':2,'c':3}; import json; hash(json.dumps({k:item.get(k) for k in ['a', 'b', 'c']}))"
200000 loops, best of 5: 1.41 usec per loop
>>> python -m timeit "item = {'a':1,'b':2,'c':3}; hash('|'.join(f'{k}:{item.get(k)}' for k in ['a', 'b', 'c']))"
500000 loops, best of 5: 599 nsec per loop
>>> python -m timeit "item = {'a':1,'b':2,'c':3}; hash(','.join(f'\"{k}\": \"{item.get(k)}\"' for k in ['a', 'b', 'c']))"
500000 loops, best of 5: 649 nsec per loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really think the original implementation was OK, it would be extremely unlikely for this to be an issue with real data, and if that ever becomes a concern, we could always allow customizing |
into something else. But I think it makes sense to have an xfail test, as a “we know this can happen in theory” statement in code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added xfail in c562bc6. Cheers again for raising!
] | ||
DUD_LOAD_RULE_PATHS = [ | ||
"/home/user/project/custom_rules1.json", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be helpful to explain how to use locally defined rules with spiders running in Scrapy Cloud?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. And how to combine custom rules with the default ones. But it may be out of scope for this PR, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this PR doesn't change this part of the doc except formatting but it's nice to have as a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for raising @PyExplorer . Created #19 for tracking.
Merging this one. Thanks for the review folks! |
No description provided.