Skip to content

Releases: AndyTheFactory/newspaper4k

Minor bug fix

18 Mar 21:56
Compare
Choose a tag to compare

Some fixes with regards to python >= 3.11 dependencies. Numpy version was incompatible with colab. Now it is fixed.

Also, there was a typo in the Nepali language code - it was "np" instead of "ne". This is now fixed.

Version 0.9.3 Article Parsing improvements and huge jump in multi language support (support for over 40 languages added)

18 Mar 00:10
Compare
Choose a tag to compare

Massive improvements in multi-language capabilities. Added over 40 new languages and completely reworked the language module. Much easier to add new languages now. Additionally, added support for Google News as a source. You can now search and parse news based on keywords, topic, location or website.
Integrated cloudscraper as an optional dependency. If installed, it will us cloudscraper as a layer over requests. Cloudscraper tries to bypass cloudflair protection.
We now have use two evaluation datasets - the one from scrapinghub and one created by us drom the top 200 most popular websites. This will help keeping track of future improvements and to have a clear view of the impact of the changes.

We see a steady improvement from version 0.9.0 up to 0.9.3. The evaluation results are available in the documentation. The evaluation dataset is also available in the following repository: Article Extraction Dataset

  • You can now install languages that need special packages as optional dependencies
  • Google News full integrated in the scraping process.
  • You can now pickle sources and articles - easier to save and recover scraping
  • Bumped minimum python version support to Python 3.8

Version 0.9.2 some major changes in document parsing

14 Jan 11:36
97fdcb0
Compare
Choose a tag to compare
  • You can now us the module as a command line interface (CLI). Usage: python -m newspaper --url https://www.test.com. More information in the documentation.
  • I have added an evaluation script against a dataset from scrapinghub. This will help keeping track of future improvements.
  • Better handling of multithreaded requests. The previous version had a bug that could lead to a deadlock. I implemented ThreadPoolExecutor from the concurrent.futures module, which is more stable. The previously news_pool was replaced with a fetch_news() function.
  • Caching is now much more flexible. You can disable it completely or for one request.
  • You can now use newspaper.article() function for convenience. It will create, download and parse an article in one step. It takes all the parameters of the Article class.
  • protected sites by cloudflare are better detected and raise an exception. The reason will be in the exception message.

Version 0.9.1 code refactoring and bugfixes

08 Nov 13:40
Compare
Choose a tag to compare

New feature:

  • version bump(f7107be)
  • tests: Add test case for(592f6f6)
  • parse: added possibility to follow "read more" links in articles(0720de1)
  • Allow to pass any requests parameter to the Article constructor. You can now pass verify=False in order to ignore certificate errors (issue #462)(5ff5d27)
  • parse: extended data parsing of json-ld metadata (issue #518)(fc413af)
  • tests: added script to create test cases(9df8c16)
  • parse: added tag for date detection issue #835(41152eb)
  • parse: added og:regDate to known date tags(dc35e29)
  • tests: convert unittest to pytest(45c4e8d)

Bugs fixed:

  • typing annotation for set python 3.8(895343f)
  • parse: improve meta tag content for articles and pubdate(37bb0b7)
  • parse: 📝 improved author detection. improved video links detection(23c547f)
  • parse: ensured that clean_doc/doc to clean_top_node are on the same DOM. And doc/top_node on the same DOM.(6874d05)
  • small changes, replace os.path with pathlib(5598d95)
  • parse: use one file of stopwords for english, the one in the standard folder #503(6bdf813)
  • parse: better author parsing based on issue #493(f93a9c2)
  • parse: make the url date parsing stricter. Issue #514(0cc1e83)
  • parse: replace \n with space in sentence split (Issue #506)(3ccb87c)
  • parsing: catch url errors resulting resulting from parsed image links(9140a04)
  • correct python versions in pipeline(7e671df)
  • gitignore update(8855f00)

First release after the fork

29 Oct 23:27
Compare
Choose a tag to compare

First release after the fork. This release is based on the 0.1.7 release of the original newspaper3k project. I jumped versions such that it is clear that this is a fork and not the original project.

New feature:

  • tests: starting moving tests to pytest(f294a01) (by Andrei)
  • parser: add yoast schema parse for date extraction(39a5cff) (by Andrei)

Bugs fixed:

  • docs: update README.md(d5f9209) (by Andrei)
  • feed_url parsing, issue #915(ec2d474) (by Andrei)
  • better content detection. added and
    tag as candidate for content parent_node(447a429) (by Andrei)
  • close pickle files - PR #938(d7608da) (by Andrei)
  • parsing: improved publication date extraction(4d137eb) (by Andrei)
  • some linter errors, whitespaces and spelling(79553f6) (by Andrei)