Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added prune xpath to spider #684

Merged
merged 3 commits into from
Aug 30, 2024
Merged

Added prune xpath to spider #684

merged 3 commits into from
Aug 30, 2024

Conversation

felipehertzer
Copy link
Contributor

Hey @adbar,

This pull request adds the prune_xpath feature to the Spider function. As part of this update, I've removed the process_response to consolidate the pruning logic into a single function.

Please let me know if you have any feedback or suggestions!

Thanks.

Copy link

codecov bot commented Aug 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.55%. Comparing base (b3aea4a) to head (d09fda2).
Report is 2 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master     #684   +/-   ##
=======================================
  Coverage   98.54%   98.55%           
=======================================
  Files          21       21           
  Lines        3517     3529   +12     
=======================================
+ Hits         3466     3478   +12     
  Misses         51       51           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@adbar
Copy link
Owner

adbar commented Aug 29, 2024

Hi @felipehertzer, nice to see that spider is adaptable and suits your needs.

I would like to see prune_xpath as an additional parameter in the CrawlParameters class.

That way you could just pass it along with the other parameters and make the following changes:

  • I would prefer to keep process_response(). You can simply add the code you contributed to this function.
  • crawl_page() does not modify the prune_xpath setting, so you don't need to add it to the function's arguments.

Does that work for you?

@felipehertzer
Copy link
Contributor Author

Hey @adbar,

Thanks for the feedback—I understand how the parameters are set up now and have made the requested changes.

I noticed that the code below directly calls process_links instead of using the process_response function. How would you prefer we fix this? Should we move everything into probe_alternative_homepage, or would you rather replace process_links with process_response as a quick workaround? (process_response, receives a Response instead of htmlstring)

htmlstring, homepage, new_base_url = probe_alternative_homepage(url)
if htmlstring and homepage and new_base_url:
    # register potentially new homepage
    URL_STORE.add_urls([homepage])
    # extract links on homepage
    process_links(htmlstring, params, url=url)

@adbar
Copy link
Owner

adbar commented Aug 30, 2024

Hi @felipehertzer, the tests don't pass for all versions, could you check if you can fix them?

What happened to the line URL_STORE.add_urls([response.url], visited=True) (in your first commit), maybe it was a good idea?

The code you mention is not an issue because probe_alternative_homepage() returns a string, or am I getting it wrong?

@felipehertzer
Copy link
Contributor Author

Hey @adbar,

The tests are currently pending a solution for the crawl_page function. Specifically, when initial is set to true, the process_response function does not execute. I have not found an optimal way to run it effectively when initial is either true or false, as the true condition returns an htmlstring, while the false condition returns a response

@adbar
Copy link
Owner

adbar commented Aug 30, 2024

I'm not sure how to untangle that right now, if you have an idea feel free to test it.

@adbar
Copy link
Owner

adbar commented Aug 30, 2024

@felipehertzer I think the best way would be to isolate this code in a function and/or to call it in process_links():

    if htmlstring and params.prune_xpath is not None:
        if isinstance(params.prune_xpath, str):
            params.prune_xpath = [params.prune_xpath]
        tree = load_html(htmlstring)
        tree = prune_unwanted_nodes(tree, [XPath(x) for x in params.prune_xpath])
        htmlstring = tostring(tree).decode()

@felipehertzer
Copy link
Contributor Author

Yes, you are right, it does make way more sense, I will change it

@felipehertzer
Copy link
Contributor Author

Alright, I just updated it.

@adbar adbar merged commit 7836a76 into adbar:master Aug 30, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants