Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle pages where article is split into multiple sibling nodes #163

Closed
wants to merge 5 commits into from

Conversation

naftalibeder
Copy link

@naftalibeder naftalibeder commented Jan 20, 2022

This fixes #85 (and #159).

It involved a bit of a refactor of the extract_content function, but the basic idea is that it looks through all of the children in the subtree returned from tree.xpath(expr), not just stopping at the first child like before. Beyond that, it pulls out the logic that checks whether the BODY_XPATH expression matched in the current loop iteration has found a useful subtree, to make it a little more readable, and only performs the final cleanup and look-elsewhere logic at the very end.

So essentially, on finding a subtree whose first node is valid, we proceeded to consider all of the remaining nodes in that subtree.

This seems to work great, although I haven't run it through the automated tests. (I had trouble running the url tests.)

Let me know what you think. Happy to talk through anything, and if/when this seems good to you, I'll clean it up (print statements, code style, etc.).

Thanks!

@lgtm-com
Copy link

lgtm-com bot commented Jan 20, 2022

This pull request introduces 2 alerts when merging 0a238f5 into 6fdf22a - view on LGTM.com

new alerts:

  • 2 for Unused import

@adbar
Copy link
Owner

adbar commented Jan 20, 2022

Hi @naftalibeder, thanks for your effort, here are my comments.

Issues
The tests on real-world data don't pass, and I'm afraid there could be a series of errors in the process, I got one in core.py line 540 (IndexError: list index out of range) as I just tried to test it on my benchmark data.

It seems to me you have optimized the code for the issues above but by doing so you have made it more brittle and less accurate overall.

Possible goal
If (and only if) it improves the recall on the benchmark data without losing too much precision it would be a case for the favor_recall=True option in extract(). The users can indeed choose if they want more precision or more text.

Tests
Here is how you can run the tests if you wish to correct the errors and further improve the code:

  • Run pytest from trafilatura's directory, or pytest realworld_tests.py or simply python3 realworld_tests.py (since this part is failing)
  • Check how it performs on the benchmark in tests/eval/ by running tests/comparison.py (see tests/README.rst)

That being said I'm not convinced at this point. Although your approach does have its merits, overall performance requires a subtle equilibrium.

@naftalibeder
Copy link
Author

@adbar Thank you for the thorough response! I have no problem with the burden of proof being on me for this change. :)

I ran the tests on the master branch (pytest tests/realworld_tests.py), and unfortunately I get a failure there:

        result = load_mock_page('https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-gendersprache/', xmloutput)
>       assert 'Zweitens wird der Genderstern' in result and 'alldem leider – nichts.' in result # and 'Beitragsbild' not in result
E       AssertionError: assert ('Zweitens wird der Genderstern' in '! D O C T Y P E h t m l >')

tests/realworld_tests.py:190: AssertionError
------------------------------ Captured log call -------------------------------
ERROR    trafilatura.core:core.py:625 not enough text https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-gendersprache/
______________________________ test_extract[True] ______________________________

Does that not happen for you? In order to attempt to fix the regressions in my own branch, I'd like to have a success baseline to test against.

@adbar
Copy link
Owner

adbar commented Jan 21, 2022

@naftalibeder the test goes wrong because the code is far off the grid in terms of precision or recall. These text segments are critical parts (usually just before or after the main text).

The tests in the main branch all pass.

@naftalibeder
Copy link
Author

My apologies if I wasn't clear - I'm saying that the master branch fails the test suite as well (for me, at least), from a clean clone and install. I just opened up a new issue about this (#166).

It's not directly related to this PR, but I'd like to try to solve the regressions I introduced here, and in order to do so I need to know that a successful fix will pass the tests.

@adbar
Copy link
Owner

adbar commented Mar 17, 2022

Hi @naftalibeder, sadly nothing appears to be moving about #166, and now the main branch has evolved.

Are you still ready to try things out? I could see if I can run efficiency tests in your PR. If it doesn't lead to a regression on my data I could integrate some of the code.

@naftalibeder
Copy link
Author

I’m certainly willing to do more testing! I wasn’t sure what to do without being able to test on my computer, so I was kind of in a holding pattern. Let me know how I can help :)

@adbar
Copy link
Owner

adbar commented Mar 21, 2022

We need to decide what to do you the ideas behind this pull request: Does your approach lead to improvements in extraction accuracy? I doubt it but feel free to prove it by using the benchmark.

@naftalibeder
Copy link
Author

Sorry for not replying sooner. Since I’m unable to test easily, and the main code has diverged so much, I think it’s safe to close this. I’ll reopen if things change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants