Added prune xpath to spider #684

felipehertzer · 2024-08-29T06:19:28Z

This pull request adds the prune_xpath feature to the Spider function. As part of this update, I've removed the process_response to consolidate the pruning logic into a single function.

Please let me know if you have any feedback or suggestions!

Thanks.

codecov · 2024-08-29T06:24:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.55%. Comparing base (b3aea4a) to head (d09fda2).
Report is 2 commits behind head on master.

Additional details and impacted files

@@           Coverage Diff           @@
##           master     #684   +/-   ##
=======================================
  Coverage   98.54%   98.55%           
=======================================
  Files          21       21           
  Lines        3517     3529   +12     
=======================================
+ Hits         3466     3478   +12     
  Misses         51       51

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

adbar · 2024-08-29T09:26:16Z

Hi @felipehertzer, nice to see that spider is adaptable and suits your needs.

I would like to see prune_xpath as an additional parameter in the CrawlParameters class.

That way you could just pass it along with the other parameters and make the following changes:

I would prefer to keep process_response(). You can simply add the code you contributed to this function.
crawl_page() does not modify the prune_xpath setting, so you don't need to add it to the function's arguments.

Does that work for you?

felipehertzer · 2024-08-30T00:18:54Z

Hey @adbar,

Thanks for the feedback—I understand how the parameters are set up now and have made the requested changes.

I noticed that the code below directly calls process_links instead of using the process_response function. How would you prefer we fix this? Should we move everything into probe_alternative_homepage, or would you rather replace process_links with process_response as a quick workaround? (process_response, receives a Response instead of htmlstring)

htmlstring, homepage, new_base_url = probe_alternative_homepage(url)
if htmlstring and homepage and new_base_url:
    # register potentially new homepage
    URL_STORE.add_urls([homepage])
    # extract links on homepage
    process_links(htmlstring, params, url=url)

adbar · 2024-08-30T12:27:58Z

Hi @felipehertzer, the tests don't pass for all versions, could you check if you can fix them?

What happened to the line URL_STORE.add_urls([response.url], visited=True) (in your first commit), maybe it was a good idea?

The code you mention is not an issue because probe_alternative_homepage() returns a string, or am I getting it wrong?

felipehertzer · 2024-08-30T12:58:39Z

Hey @adbar,

The tests are currently pending a solution for the crawl_page function. Specifically, when initial is set to true, the process_response function does not execute. I have not found an optimal way to run it effectively when initial is either true or false, as the true condition returns an htmlstring, while the false condition returns a response

adbar · 2024-08-30T13:10:19Z

I'm not sure how to untangle that right now, if you have an idea feel free to test it.

adbar · 2024-08-30T13:16:42Z

@felipehertzer I think the best way would be to isolate this code in a function and/or to call it in process_links():

    if htmlstring and params.prune_xpath is not None:
        if isinstance(params.prune_xpath, str):
            params.prune_xpath = [params.prune_xpath]
        tree = load_html(htmlstring)
        tree = prune_unwanted_nodes(tree, [XPath(x) for x in params.prune_xpath])
        htmlstring = tostring(tree).decode()

felipehertzer · 2024-08-30T13:22:10Z

Yes, you are right, it does make way more sense, I will change it

felipehertzer · 2024-08-30T13:31:50Z

Alright, I just updated it.

Added prune xpath to spider

21728d0

updated prune xpath in favor to params

986447d

Moved prune code to process_links

d09fda2

adbar merged commit 7836a76 into adbar:master Aug 30, 2024
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added prune xpath to spider #684

Added prune xpath to spider #684

felipehertzer commented Aug 29, 2024

codecov bot commented Aug 29, 2024 •

edited

Loading

adbar commented Aug 29, 2024

felipehertzer commented Aug 30, 2024

adbar commented Aug 30, 2024

felipehertzer commented Aug 30, 2024

adbar commented Aug 30, 2024

adbar commented Aug 30, 2024

felipehertzer commented Aug 30, 2024

felipehertzer commented Aug 30, 2024

Added prune xpath to spider #684

Added prune xpath to spider #684

Conversation

felipehertzer commented Aug 29, 2024

codecov bot commented Aug 29, 2024 • edited Loading

Codecov Report

adbar commented Aug 29, 2024

felipehertzer commented Aug 30, 2024

adbar commented Aug 30, 2024

felipehertzer commented Aug 30, 2024

adbar commented Aug 30, 2024

adbar commented Aug 30, 2024

felipehertzer commented Aug 30, 2024

felipehertzer commented Aug 30, 2024

codecov bot commented Aug 29, 2024 •

edited

Loading