Handle pages where article is split into multiple sibling nodes #163

naftalibeder · 2022-01-20T04:42:58Z

This fixes #85 (and #159).

It involved a bit of a refactor of the extract_content function, but the basic idea is that it looks through all of the children in the subtree returned from tree.xpath(expr), not just stopping at the first child like before. Beyond that, it pulls out the logic that checks whether the BODY_XPATH expression matched in the current loop iteration has found a useful subtree, to make it a little more readable, and only performs the final cleanup and look-elsewhere logic at the very end.

So essentially, on finding a subtree whose first node is valid, we proceeded to consider all of the remaining nodes in that subtree.

This seems to work great, although I haven't run it through the automated tests. (I had trouble running the url tests.)

Let me know what you think. Happy to talk through anything, and if/when this seems good to you, I'll clean it up (print statements, code style, etc.).

Thanks!

…g-nodes

lgtm-com · 2022-01-20T04:56:46Z

This pull request introduces 2 alerts when merging 0a238f5 into 6fdf22a - view on LGTM.com

new alerts:

2 for Unused import

adbar · 2022-01-20T15:29:43Z

Hi @naftalibeder, thanks for your effort, here are my comments.

Issues
The tests on real-world data don't pass, and I'm afraid there could be a series of errors in the process, I got one in core.py line 540 (IndexError: list index out of range) as I just tried to test it on my benchmark data.

It seems to me you have optimized the code for the issues above but by doing so you have made it more brittle and less accurate overall.

Possible goal
If (and only if) it improves the recall on the benchmark data without losing too much precision it would be a case for the favor_recall=True option in extract(). The users can indeed choose if they want more precision or more text.

Tests
Here is how you can run the tests if you wish to correct the errors and further improve the code:

Run pytest from trafilatura's directory, or pytest realworld_tests.py or simply python3 realworld_tests.py (since this part is failing)
Check how it performs on the benchmark in tests/eval/ by running tests/comparison.py (see tests/README.rst)

That being said I'm not convinced at this point. Although your approach does have its merits, overall performance requires a subtle equilibrium.

naftalibeder · 2022-01-21T04:31:41Z

@adbar Thank you for the thorough response! I have no problem with the burden of proof being on me for this change. :)

I ran the tests on the master branch (pytest tests/realworld_tests.py), and unfortunately I get a failure there:

        result = load_mock_page('https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-gendersprache/', xmloutput)
>       assert 'Zweitens wird der Genderstern' in result and 'alldem leider – nichts.' in result # and 'Beitragsbild' not in result
E       AssertionError: assert ('Zweitens wird der Genderstern' in '! D O C T Y P E h t m l >')

tests/realworld_tests.py:190: AssertionError
------------------------------ Captured log call -------------------------------
ERROR    trafilatura.core:core.py:625 not enough text https://scilogs.spektrum.de/engelbart-galaxis/die-ablehnung-der-gendersprache/
______________________________ test_extract[True] ______________________________

Does that not happen for you? In order to attempt to fix the regressions in my own branch, I'd like to have a success baseline to test against.

…g-nodes

adbar · 2022-01-21T17:57:21Z

@naftalibeder the test goes wrong because the code is far off the grid in terms of precision or recall. These text segments are critical parts (usually just before or after the main text).

The tests in the main branch all pass.

naftalibeder · 2022-01-22T02:41:18Z

My apologies if I wasn't clear - I'm saying that the master branch fails the test suite as well (for me, at least), from a clean clone and install. I just opened up a new issue about this (#166).

It's not directly related to this PR, but I'd like to try to solve the regressions I introduced here, and in order to do so I need to know that a successful fix will pass the tests.

adbar · 2022-03-17T11:19:10Z

Hi @naftalibeder, sadly nothing appears to be moving about #166, and now the main branch has evolved.

Are you still ready to try things out? I could see if I can run efficiency tests in your PR. If it doesn't lead to a regression on my data I could integrate some of the code.

naftalibeder · 2022-03-20T16:26:20Z

I’m certainly willing to do more testing! I wasn’t sure what to do without being able to test on my computer, so I was kind of in a holding pattern. Let me know how I can help :)

adbar · 2022-03-21T16:17:58Z

We need to decide what to do you the ideas behind this pull request: Does your approach lead to improvements in extraction accuracy? I doubt it but feel free to prove it by using the benchmark.

naftalibeder · 2022-04-16T15:33:05Z

Sorry for not replying sooner. Since I’m unable to test easily, and the main code has diverged so much, I think it’s safe to close this. I’ll reopen if things change.

naftalibeder added 3 commits January 14, 2022 22:48

Block in handling sibling article blocks

285378e

Consider all matching nodes as possible article, not just first

5191074

Merge branch 'master' of github.com:adbar/trafilatura into fix/siblin…

0a238f5

…g-nodes

naftalibeder mentioned this pull request Jan 20, 2022

Are there any settings that allow us to make sure that the full article is scraped inspead of just the initial part of it? #85

Closed

Remove unused imports

8bdf7a8

Merge branch 'master' of github.com:adbar/trafilatura into fix/siblin…

956cbd3

…g-nodes

adbar force-pushed the master branch from 3355356 to dc8b718 Compare February 10, 2022 19:20

naftalibeder closed this Apr 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle pages where article is split into multiple sibling nodes #163

Handle pages where article is split into multiple sibling nodes #163

naftalibeder commented Jan 20, 2022 •

edited

Loading

lgtm-com bot commented Jan 20, 2022

adbar commented Jan 20, 2022

naftalibeder commented Jan 21, 2022

adbar commented Jan 21, 2022 •

edited

Loading

naftalibeder commented Jan 22, 2022

adbar commented Mar 17, 2022

naftalibeder commented Mar 20, 2022

adbar commented Mar 21, 2022

naftalibeder commented Apr 16, 2022

Handle pages where article is split into multiple sibling nodes #163

Handle pages where article is split into multiple sibling nodes #163

Conversation

naftalibeder commented Jan 20, 2022 • edited Loading

lgtm-com bot commented Jan 20, 2022

adbar commented Jan 20, 2022

naftalibeder commented Jan 21, 2022

adbar commented Jan 21, 2022 • edited Loading

naftalibeder commented Jan 22, 2022

adbar commented Mar 17, 2022

naftalibeder commented Mar 20, 2022

adbar commented Mar 21, 2022

naftalibeder commented Apr 16, 2022

naftalibeder commented Jan 20, 2022 •

edited

Loading

adbar commented Jan 21, 2022 •

edited

Loading