Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lost Links causing failing data integrity! #51

Open
atb00ker opened this issue Dec 12, 2017 · 5 comments
Open

Lost Links causing failing data integrity! #51

atb00ker opened this issue Dec 12, 2017 · 5 comments

Comments

@atb00ker
Copy link
Contributor

The total number of requests send are not coming equal to received + dropped/failed for some spiders!
The bug needs to be addressed for ensuring the integrity of the database!
The following spiders have this problem:
IndiaTv
time(Tech)
firstpost(sports)
firstpost(hindi)
More spiders might have the issue, but these have been caught misbehaving currently!
@thisisayush

@atb00ker atb00ker changed the title Lost Links! Lost Links causing failing data integrity! Dec 12, 2017
@thisisayush
Copy link
Collaborator

In many cases it may happen that parsed != (Dropped + stored)
Reasons:
Before continuing the discussion, let's understand the terms,

parsed: Those URL's which were sent to parse_arcticle function to finally yield.
scraped: Those URL's which were recieved by parse_article and we're yielded without errors.
dropped: Those URL's which were dropped due to errors or duplicates. P.S. It may happen that dropped URL's > parsed because Applying Duplicate Check in the spider itself which prevent parse_article function beforehand.
stored: URL's stored successfully in the datebase.

@thisisayush
Copy link
Collaborator

Now, regarding the issue,
What you need to check for error verification is,
parsed = stored (if duplicates check is applied on the spider too)
parsed = dropped + stored ( if only pipelines handle duplicates)
If above fails, means errors.

@atb00ker
Copy link
Contributor Author

atb00ker commented Dec 15, 2017

Before continuing the discussion, let's understand the terms

You have misunderstood the terms, refer: https://github.com/vipulgupta2048/scrape/projects/1#card-6130099

Now, regarding the issue,
What you need to check for error verification is,
parsed = stored (if duplicates check is applied on the spider too)
parsed = dropped + stored ( if only pipelines handle duplicates)
If above fails, means errors.

Maybe i did not explain the issue properly;
requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs
This equation needs to be true but fails and that is the issue! :)

@thisisayush
Copy link
Collaborator

Maybe i did not explain the issue properly;
requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs
This equation needs to be true but fails and that is the issue! :)

Oh. What key is updated when errback runs? and what key is updated when the pipeline drops an item?

@atb00ker
Copy link
Contributor Author

atb00ker commented Dec 16, 2017

Maybe i did not explain the issue properly;
requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs
This equation needs to be true but fails and that is the issue! :)

Oh. What key is updated when errback runs? and what key is updated when the pipeline drops an item?

None of keys, please read the code before posting questions! :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants