-
-
Notifications
You must be signed in to change notification settings - Fork 254
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle pages where article is split into multiple sibling nodes #163
Conversation
This pull request introduces 2 alerts when merging 0a238f5 into 6fdf22a - view on LGTM.com new alerts:
|
Hi @naftalibeder, thanks for your effort, here are my comments. Issues It seems to me you have optimized the code for the issues above but by doing so you have made it more brittle and less accurate overall. Possible goal Tests
That being said I'm not convinced at this point. Although your approach does have its merits, overall performance requires a subtle equilibrium. |
@adbar Thank you for the thorough response! I have no problem with the burden of proof being on me for this change. :) I ran the tests on the
Does that not happen for you? In order to attempt to fix the regressions in my own branch, I'd like to have a success baseline to test against. |
@naftalibeder the test goes wrong because the code is far off the grid in terms of precision or recall. These text segments are critical parts (usually just before or after the main text). The tests in the main branch all pass. |
My apologies if I wasn't clear - I'm saying that the master branch fails the test suite as well (for me, at least), from a clean clone and install. I just opened up a new issue about this (#166). It's not directly related to this PR, but I'd like to try to solve the regressions I introduced here, and in order to do so I need to know that a successful fix will pass the tests. |
Hi @naftalibeder, sadly nothing appears to be moving about #166, and now the main branch has evolved. Are you still ready to try things out? I could see if I can run efficiency tests in your PR. If it doesn't lead to a regression on my data I could integrate some of the code. |
I’m certainly willing to do more testing! I wasn’t sure what to do without being able to test on my computer, so I was kind of in a holding pattern. Let me know how I can help :) |
We need to decide what to do you the ideas behind this pull request: Does your approach lead to improvements in extraction accuracy? I doubt it but feel free to prove it by using the benchmark. |
Sorry for not replying sooner. Since I’m unable to test easily, and the main code has diverged so much, I think it’s safe to close this. I’ll reopen if things change. |
This fixes #85 (and #159).
It involved a bit of a refactor of the
extract_content
function, but the basic idea is that it looks through all of the children in the subtree returned fromtree.xpath(expr)
, not just stopping at the first child like before. Beyond that, it pulls out the logic that checks whether theBODY_XPATH
expression matched in the current loop iteration has found a useful subtree, to make it a little more readable, and only performs the final cleanup and look-elsewhere logic at the very end.So essentially, on finding a subtree whose first node is valid, we proceeded to consider all of the remaining nodes in that subtree.
This seems to work great, although I haven't run it through the automated tests. (I had trouble running the url tests.)
Let me know what you think. Happy to talk through anything, and if/when this seems good to you, I'll clean it up (print statements, code style, etc.).
Thanks!