Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZyteApiProvider could make an unneeded API request #91

Open
kmike opened this issue Jun 12, 2023 · 13 comments · Fixed by #156
Open

ZyteApiProvider could make an unneeded API request #91

kmike opened this issue Jun 12, 2023 · 13 comments · Fixed by #156
Assignees

Comments

@kmike
Copy link
Member

kmike commented Jun 12, 2023

In the example below ZyteApiProvide makes 2 API requests instead of 1:

@handle_urls("example.com")
@attrs.define
class MyPage(ItemPage[MyItem]):
    html: BrowserHtml
    # ...

class MySpider(scrapy.Spider):
    # ...
    def parse(self, response: DummyResponse, product: Product, my_item: MyItem):
        # ...
@Gallaecio
Copy link
Contributor

Findings so far:

  • Remove ItemProvider’s Response dependency scrapinghub/scrapy-poet#151 won’t fix this.
  • This issue seems to be caused by Zyte API provided classes being resolved at different stages. If you request both product and browser_response directly in the callback, a single request is sent. Otherwise, first Product is injected, then MyItem resolves to MyPage, then BrowserHtml is injected. I am not sure yet how to best solve that.

@Gallaecio Gallaecio self-assigned this Jun 15, 2023
@wRAR
Copy link
Contributor

wRAR commented Jun 15, 2023

Yeah, the problem AFAIK is that ItemProvider calls build_instances itself. scrapinghub/scrapy-poet#151 is actually about a third request done in this or similar use case.

@wRAR
Copy link
Contributor

wRAR commented Jun 15, 2023

We also thought the solution may involve the caching feature in ItemProvider but didn't investigate further.

@Gallaecio
Copy link
Contributor

Indeed.

@Gallaecio
Copy link
Contributor

New finding: Switching MyItem to MyPage works, even if there is still some level of indirection. Could explain why scrapinghub/scrapy-poet#153 works.

Gallaecio added a commit to scrapinghub/scrapy-poet that referenced this issue Jun 20, 2023
@BurnzZ
Copy link
Member

BurnzZ commented Oct 3, 2023

I looked into this further and it still occurs without any Page Objects involved.

The sent Zyte API requests were determined by setting ZYTE_API_LOG_REQUESTS=True.

Given the following spider:

class BooksSpider(scrapy.Spider):
    name = "books"

    def start_requests(self):
        yield scrapy.Request(
            url="https://books.toscrape.com",
            callback=self.parse_nav,
            meta={"zyte_api": {"browserHtml": True}},
        )

Case 1

✅ The following callback set up is correct since it has only 1 request:

# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Case 2

❌ However, the following has 2 separate requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response, navigation: ProductNavigation):
    ...

This case should not happen since browserHtml and productNavigation can both be present in the same Zyte API Request.

Case 3

However, if we introduce a Page Object to the same spider:

@handle_urls("")
@attrs.define
class ProductNavigationPage(ItemPage[ProductNavigation]):
    response: BrowserResponse
    nav_item: ProductNavigation

    @field
    def url(self):
        return self.nav_item.url

    @field
    def categoryName(self) -> str:
        return f"(modified) {self.nav_item.categoryName}"

❌ Then, the following callback set up would have 3 separate Zyte API Requests:

# {"browserHtml": true, "url": "https://books.toscrape.com"}
# {"productNavigation": true, "url": "https://books.toscrape.com"}
# {"browserHtml": true, "url": "https://books.toscrape.com"}
def parse_nav(self, response: DummyResponse, navigation: ProductNavigation):
    ...

Note that the same series of 3 separate requests still occurs on:

def parse_nav(self, response, navigation: ProductNavigation):
    ...

@Gallaecio
Copy link
Contributor

I wonder if some of the unexpected requests are related to #135.

@BurnzZ
Copy link
Member

BurnzZ commented Jan 9, 2024

Re-opening this since Case 2 is still occurring. Case 3 has been fixed though.

@wRAR
Copy link
Contributor

wRAR commented Jan 11, 2024

@BurnzZ so do you think after your latest analysis that case 2 still happens or not?

@BurnzZ
Copy link
Member

BurnzZ commented Jan 12, 2024

@wRAR I can still reproduce Case 2. 👍

@wRAR
Copy link
Contributor

wRAR commented Jan 12, 2024

OK, so the difference between this use case and ones that we already test is having "browserHtml": True in meta. Currently the provider doesn't check this at all. It looks like it should? cc: @kmike

@wRAR
Copy link
Contributor

wRAR commented Jan 12, 2024

OTOH I'm not sure if even we handle this in the provider the request itself won't be sent?

@kmike
Copy link
Member Author

kmike commented Jan 12, 2024

@wRAR Let's try to focus on how Case 2 (or any of these cases) affect https://github.com/zytedata/zyte-spider-templates, not on the case itself. The priority of supporting meta is not clear to me now; it may not be necessary in the end, or it could be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants