Include all nodes with text #515

AndyTheFactory · 2023-10-24T19:22:01Z

Issue by jecarr
Mon May 10 05:55:59 2021
Originally opened as codelucas/newspaper#885

Closes #363

To tackle missing article paragraphs, this suggestion considers any node with text to be included in the final text attribute of an article instance
Test cases pass with a warning where extra text has been found (i.e. equal-text asserts fail) but main article text has been found within parsed article text

Edit - found more missing text when using this url. This is because there are < li >s not being gathered. Plus the < table > at the bottom of the page didn't translate to text well. As these are further fixes (that may break how this PR fixes for other urls), my fixes for this are in jecarr#1

jecarr included the following code: https://github.com/codelucas/newspaper/pull/885/commits

AndyTheFactory · 2023-10-24T19:22:02Z

Comment by shawei3000
Wed May 12 04:52:15 2021

i followed the above steps, and updated newspaper/* files in my specific anaconda env, and still experience significant missing paragraphs for this url, ( https://www.stltoday.com/news/local/crime-and-courts/belleville-man-gets-20-years-for-ponzi-scheme/article_194000a6-1a13-5841-b53a-44305142bd23.html ), maybe this is different scenario?

AndyTheFactory · 2023-10-24T19:22:04Z

Comment by jecarr
Thu May 13 03:10:45 2021

Hey @shawei3000 - thanks for the feedback. You are right, it was a new test case for me. I was used to seeing the missing text after the text newspaper chose. That article gave me the first half of the article being the missing text (not the latter). So my first fix produced text where the article's order messed up.

I've updated the PR but the html attribute needs updating. The text attribute should have most of that URL's article text in order (the first sentence appears to not be picking up, I'll look into that too).

Thanks again for the heads up! @codelucas, feel free to highlight if my approach needs refining.

AndyTheFactory · 2024-03-18T09:16:51Z

nodes_with_text is modified in v 0.9.2, additional tags are included (such as div, etc)

AndyTheFactory added PR-verify Has a PR, must be checked sites not working labels Oct 30, 2023

AndyTheFactory modified the milestone: Release 0.9.1 Oct 30, 2023

AndyTheFactory closed this as completed Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Include all nodes with text #515

Include all nodes with text #515

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Mar 18, 2024

Include all nodes with text #515

Include all nodes with text #515

Comments

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Mar 18, 2024