Skip to content

Commit

Permalink
Merge pull request #602 from AndyTheFactory/work-0.9.2
Browse files Browse the repository at this point in the history
0.9.2 release
  • Loading branch information
AndyTheFactory authored Jan 14, 2024
2 parents 32ba20f + 384b92b commit 97fdcb0
Show file tree
Hide file tree
Showing 198 changed files with 20,360 additions and 5,143 deletions.
1 change: 1 addition & 0 deletions .codespell-dictionary.txt
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,2 +1,3 @@
doubleclick
te
shotcut
27 changes: 26 additions & 1 deletion .gitattributes
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,2 +1,27 @@
docs/* linguist-documentation
tests/* linguist-vendored
tests/* -linguist-detectable
newspaper/resources/* -linguist-detectable


# Source files
# ============
*.pxd text diff=python eol=lf
*.py text diff=python eol=lf
*.py3 text diff=python eol=lf
*.pyw text diff=python eol=lf
*.pyx text diff=python eol=lf
*.pyz text diff=python eol=lf
*.pyi text diff=python eol=lf

# Binary files
# ============
*.db binary
*.p binary
*.pkl binary
*.pickle binary
*.pyc binary export-ignore
*.pyo binary export-ignore
*.pyd binary

# Jupyter notebook
*.ipynb text eol=lf
Empty file modified .github/ISSUE_TEMPLATE/bug_report.md
100644 → 100755
Empty file.
Empty file modified .github/ISSUE_TEMPLATE/feature_request.md
100644 → 100755
Empty file.
Empty file modified .github/ISSUE_TEMPLATE/questions---help---documentation.md
100644 → 100755
Empty file.
Empty file.
Empty file modified .github/workflows/pipeline.yml
100644 → 100755
Empty file.
Empty file modified .github/workflows/pylint.yml
100644 → 100755
Empty file.
Empty file modified .github/workflows/python-publish.yml
100644 → 100755
Empty file.
4 changes: 3 additions & 1 deletion .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
.DS_Store
.idea
.pypirc
.conda

# C extensions
*.so
Expand Down Expand Up @@ -48,6 +49,7 @@ venv


# Local debug
_docs/
_docs
tests/localdebug
requirements_poetry.txt
tmp
Empty file modified .pre-commit-config.yaml
100644 → 100755
Empty file.
Empty file modified .readthedocs.yaml
100644 → 100755
Empty file.
Empty file modified .travis.yml
100644 → 100755
Empty file.
80 changes: 70 additions & 10 deletions CHANGELOG.md
100644 → 100755

Large diffs are not rendered by default.

Empty file modified GOOSE-LICENSE.txt
100644 → 100755
Empty file.
Empty file modified LICENSE
100644 → 100755
Empty file.
Empty file modified MANIFEST.in
100644 → 100755
Empty file.
8 changes: 8 additions & 0 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,14 @@ I have duplicated all issues on the original project and will try to fix them. I
pip install newspaper4k
```

You can start directly from the command line, using the included CLI:
``` bash
python -m newspaper --url="https://edition.cnn.com/2023/11/17/success/job-seekers-use-ai/index.html" --language=en --output-format=json --output-file=article.json

```

Or use the Python API:

``` python
import newspaper

Expand Down
Empty file modified docs/Makefile
100644 → 100755
Empty file.
Empty file modified docs/_static/newspaper.jpg
100644 → 100755
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions docs/conf.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@
"sphinx.ext.napoleon",
"sphinx.ext.autosummary",
"sphinx.ext.intersphinx",
"sphinxarg.ext",
]

intersphinx_mapping = {
Expand Down
1 change: 1 addition & 0 deletions docs/index.rst
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -127,6 +127,7 @@ User Guide

user_guide/quickstart
user_guide/installation
user_guide/cli_reference
user_guide/examples
user_guide/advanced
user_guide/api_reference
Expand Down
Empty file modified docs/make.bat
100644 → 100755
Empty file.
Empty file modified docs/requirements.in
100644 → 100755
Empty file.
Empty file modified docs/requirements.txt
100644 → 100755
Empty file.
3 changes: 1 addition & 2 deletions docs/user_guide/advanced.rst
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,7 @@ For example, you could:
# we are calling the shortcut function ``article()`` which will do the
# downloading and parsing for us and return an ``Article`` object.
a = article('http://www.cnn.com/2014/01/12/world/asia/north-korea-charles-smith/index.html'
, keep_article_html=True)
a = article('http://www.cnn.com/2014/01/12/world/asia/north-korea-charles-smith/index.html')
print(a.article_html)
# '<div> \n<p><strong>(CNN)</strong> -- Charles Smith insisted Sunda...'
Expand Down
Empty file modified docs/user_guide/api_reference.rst
100644 → 100755
Empty file.
48 changes: 48 additions & 0 deletions docs/user_guide/cli_reference.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
.. _cli:

Command Line Interface (CLI)
============================

.. argparse::
:module: newspaper.cli
:func: get_arparse
:prog: python -m newspaper


Examples
--------

For instance, you can download an article from cnn and save it as a json file:

.. code-block:: bash
python -m newspaper --url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html --output-format=json --output-file=cli_cnn_article.json
Or use a list of urls from a text file (one url on each line), and store all results as a csv:

.. code-block:: bash
python -m newspaper --urls-from-file=url_list.txt --output-format=csv --output-file=articles.csv
You can also use pipe redirection to read urls from stdin:

.. code-block:: bash
grep "cnn" huge_url_list.txt | python -m newspaper --urls-from-stdin --output-format=csv --output-file=articles.csv
To read the content of a local html file, use the `--html-from-file` option:

.. code-block:: bash
python -m newspaper --url=https://edition.cnn.com/2023/11/16/politics/ethics-committee-releases-santos-report/index.html --html-from-file=/home/user/myfile.html --output-format=json
Files can be read as file:// urls. If you want to preserver the original webpage url, use
the previous example with `--html-from-file` :

.. code-block:: bash
python -m newspaper --url=file:///home/user/myfile.html --output-format=json
will print out the json representation of the article, for the html file stored in `/home/user/myfile.html`.
Empty file modified docs/user_guide/examples.rst
100644 → 100755
Empty file.
1 change: 0 additions & 1 deletion docs/user_guide/installation.rst
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ The newspaper4k package has the following dependencies:
* tldextract
* Pillow
* PyYAML
* cssselect
* feedfinder2
* tinysegmenter
* pythainlp
Expand Down
Empty file modified docs/user_guide/known_issues.rst
100644 → 100755
Empty file.
Empty file modified docs/user_guide/known_newssites.rst
100644 → 100755
Empty file.
1 change: 1 addition & 0 deletions docs/user_guide/known_sites_not_working.csv
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ https://ec.europa.eu/commission/presscorner/home/en, Site generated with javascr
https://www.newspapers.com/, Scanned content, 2023-10-28
https://www.alarabiya.net/, Protected by cloudflair, 2023-11-03
https://www.investors.com, Protected by perimeterx, 2023-11-05
https://www.chicagobusiness.com/, Protected by some framework, 2023-11-18
Empty file modified docs/user_guide/languages.rst
100644 → 100755
Empty file.
Empty file modified docs/user_guide/quickstart.rst
100644 → 100755
Empty file.
26 changes: 0 additions & 26 deletions download_corpora.py

This file was deleted.

19 changes: 12 additions & 7 deletions newspaper/__init__.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -23,15 +23,12 @@
popular_urls,
Configuration as Config,
)
from .article import Article, ArticleException
from .mthreading import NewsPool
from .article import Article
from .source import Source
from .version import __version__
import logging
from logging import NullHandler

news_pool = NewsPool()

from .exceptions import ArticleBinaryDataException, ArticleException

# Set default logging handler to avoid "No handler found" warnings.
logging.getLogger(__name__).addHandler(NullHandler())
Expand All @@ -43,6 +40,9 @@ def article(url: str, language: Optional[str] = "en", **kwargs) -> Article:
Args:
url (str): The URL of the article to download and parse.
language (str): The language of the article to download and parse.
input_html (str): The HTML of the article to parse. This
is used for pre-downloaded articles. If this is set,
then there will be no download requests made.
kwargs: Any other keyword arguments to pass to the Article constructor.
Returns:
Expand All @@ -51,8 +51,13 @@ def article(url: str, language: Optional[str] = "en", **kwargs) -> Article:
Raises:
ArticleException: If the article could not be downloaded or parsed.
"""
if "input_html" in kwargs:
input_html = kwargs["input_html"]
del kwargs["input_html"]
else:
input_html = None
a = Article(url, language=language, **kwargs)
a.download()
a.download(input_html=input_html)
a.parse()
return a

Expand All @@ -68,7 +73,7 @@ def article(url: str, language: Optional[str] = "en", **kwargs) -> Article:
"Config",
"Article",
"ArticleException",
"ArticleBinaryDataException",
"Source",
"__version__",
"news_pool",
]
4 changes: 4 additions & 0 deletions newspaper/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
from newspaper.cli import main

if __name__ == "__main__":
main()
9 changes: 5 additions & 4 deletions newspaper/api.py
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -9,15 +9,16 @@
from .configuration import Configuration
from .settings import POPULAR_URLS, TRENDING_URL
from .source import Source
from .utils import extend_config, print_available_languages
from .utils import print_available_languages
import newspaper.parsers as parsers


def build(url="", dry=False, config=None, **kwargs) -> Source:
"""Returns a constructed source object without
downloading or parsing the articles
"""
config = config or Configuration()
config = extend_config(config, kwargs)
config.update(**kwargs)
url = url or ""
s = Source(url, config=config)
if not dry:
Expand All @@ -30,7 +31,7 @@ def build_article(url="", config=None, **kwargs) -> Article:
or parsing
"""
config = config or Configuration()
config = extend_config(config, kwargs)
config.update(**kwargs)
url = url or ""
a = Article(url, config=config)
return a
Expand Down Expand Up @@ -75,7 +76,7 @@ def fulltext(html, language="en"):
document_cleaner = DocumentCleaner(config)
output_formatter = OutputFormatter(config)

doc = config.get_parser().fromstring(html)
doc = parsers.fromstring(html)
doc = document_cleaner.clean(doc)

extractor.calculate_best_node(doc)
Expand Down
Loading

0 comments on commit 97fdcb0

Please sign in to comment.