Lost Links causing failing data integrity! #51

atb00ker · 2017-12-12T18:32:37Z

The total number of requests send are not coming equal to received + dropped/failed for some spiders!
The bug needs to be addressed for ensuring the integrity of the database!
The following spiders have this problem:
IndiaTv
time(Tech)
firstpost(sports)
firstpost(hindi)
More spiders might have the issue, but these have been caught misbehaving currently!
@thisisayush

thisisayush · 2017-12-12T23:57:30Z

In many cases it may happen that parsed != (Dropped + stored)
Reasons:
Before continuing the discussion, let's understand the terms,

parsed: Those URL's which were sent to parse_arcticle function to finally yield.
scraped: Those URL's which were recieved by parse_article and we're yielded without errors.
dropped: Those URL's which were dropped due to errors or duplicates. P.S. It may happen that dropped URL's > parsed because Applying Duplicate Check in the spider itself which prevent parse_article function beforehand.
stored: URL's stored successfully in the datebase.

thisisayush · 2017-12-13T00:00:01Z

Now, regarding the issue,
What you need to check for error verification is,
parsed = stored (if duplicates check is applied on the spider too)
parsed = dropped + stored ( if only pipelines handle duplicates)
If above fails, means errors.

atb00ker · 2017-12-15T10:55:19Z

Before continuing the discussion, let's understand the terms

You have misunderstood the terms, refer: https://github.com/vipulgupta2048/scrape/projects/1#card-6130099

Now, regarding the issue,
What you need to check for error verification is,
parsed = stored (if duplicates check is applied on the spider too)
parsed = dropped + stored ( if only pipelines handle duplicates)
If above fails, means errors.

Maybe i did not explain the issue properly;
requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs
This equation needs to be true but fails and that is the issue! :)

thisisayush · 2017-12-16T07:03:30Z

Maybe i did not explain the issue properly;
requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs
This equation needs to be true but fails and that is the issue! :)

Oh. What key is updated when errback runs? and what key is updated when the pipeline drops an item?

atb00ker · 2017-12-16T08:02:57Z

Maybe i did not explain the issue properly;
requests send by scrapy.Request(url=url, callback=self.parse, errback=self.errorRequestHandler) = times callback runs+ times errback runs
This equation needs to be true but fails and that is the issue! :)

Oh. What key is updated when errback runs? and what key is updated when the pipeline drops an item?

None of keys, please read the code before posting questions! :)

atb00ker changed the title ~~Lost Links!~~ Lost Links causing failing data integrity! Dec 12, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lost Links causing failing data integrity! #51

Lost Links causing failing data integrity! #51

atb00ker commented Dec 12, 2017

thisisayush commented Dec 12, 2017

thisisayush commented Dec 13, 2017

atb00ker commented Dec 15, 2017 •

edited

Loading

thisisayush commented Dec 16, 2017

atb00ker commented Dec 16, 2017 •

edited

Loading

Lost Links causing failing data integrity! #51

Lost Links causing failing data integrity! #51

Comments

atb00ker commented Dec 12, 2017

thisisayush commented Dec 12, 2017

thisisayush commented Dec 13, 2017

atb00ker commented Dec 15, 2017 • edited Loading

thisisayush commented Dec 16, 2017

atb00ker commented Dec 16, 2017 • edited Loading

atb00ker commented Dec 15, 2017 •

edited

Loading

atb00ker commented Dec 16, 2017 •

edited

Loading