Skip to content

Commit

Permalink
prepare version 1.10.0 (#608)
Browse files Browse the repository at this point in the history
* prepare version 1.10.0

* fixes
  • Loading branch information
adbar authored May 30, 2024
1 parent bbf7bec commit b36b6fa
Show file tree
Hide file tree
Showing 7 changed files with 52 additions and 6 deletions.
27 changes: 27 additions & 0 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,33 @@
## History / Changelog


### 1.10.0

Breaking changes:
- raise errors on deprecated CLI and function arguments (#581)
- regroup classes and functions linked to deduplication (#582)
``trafilatura.hashing````trafilatura.deduplication``

Extraction:
- port of is_probably_readerable from readability.js by @zirkelc in #587
- Markdown table fixes by @naktinis in #601
- fix list spacing in TXT output (#598)
- CLI fixes: file processing options, mtime, and tests (#605)
- CLI fix: read standard input as binary (#607)

Downloads:
- fix deflate and add optional zstd to accepted encodings (#594)
- spider fix: use internal download utilities for robots.txt (#590)

Maintenance:
- add author XPaths (#567)
- update justext and lxml dependencies (#593)
- simplify code: unique function for length tests (#591)

Docs:
- fix typos by @RainRat in #603


### 1.9.0

Extraction:
Expand Down
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ search engine optimization, and information security).
- Optional elements: comments, links, images, tables

- Multiple output formats:
- Text (minimal formatting or Markdown)
- Text
- Markdown (with formatting)
- CSV (with metadata)
- JSON (with metadata)
- XML or [XML-TEI](https://tei-c.org/) (with metadata, text formatting and page structure)
Expand Down
3 changes: 2 additions & 1 deletion docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,8 @@ Features
- Formatting and structure: paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting
- Optional elements: comments, links, images, tables
- Multiple output formats:
- Text (minimal formatting or Markdown)
- Text
- Markdown (with formatting)
- CSV (with metadata)
- JSON (with metadata)
- XML or `XML-TEI <https://tei-c.org/>`_ (with metadata, text formatting and page structure)
Expand Down
2 changes: 2 additions & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,8 @@ py3langid
Language detection on extracted main text
pycurl
Faster downloads, possibly less robust though
zstandard
Additional compression algorithm for downloads



Expand Down
19 changes: 17 additions & 2 deletions docs/usage-python.rst
Original file line number Diff line number Diff line change
Expand Up @@ -157,6 +157,17 @@ This function emulates the behavior of similar functions in other packages, it i
>>> html2txt(downloaded)
Guessing if text can be found
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The function ``is_probably_readerable()`` has been ported from Mozilla's Readability.js, it is available from version 1.10.0 onwards and provides a way to guess if a page probably has a main text to extract.

.. code-block:: python
>>> from trafilatura.readability_lxml import is_probably_readerable
>>> is_probably_readerable(html) # HTML string or already parsed tree
Language identification
^^^^^^^^^^^^^^^^^^^^^^^

Expand Down Expand Up @@ -199,7 +210,7 @@ The `SimHash <https://en.wikipedia.org/wiki/SimHash>`_ method (also called Chari
.. code-block:: python
# create a Simhash for near-duplicate detection
>>> from trafilatura.hashing import Simhash
>>> from trafilatura.deduplication import Simhash
>>> first = Simhash("This is a text.")
>>> second = Simhash("This is a test.")
>>> second.similarity(first)
Expand All @@ -217,11 +228,15 @@ Other convenience functions include generation of file names based on their cont
.. code-block:: python
# create a filename-safe string by hashing the given content
>>> from trafilatura.hashing import generate_hash_filename
>>> from trafilatura.deduplication import generate_hash_filename
>>> generate_hash_filename("This is a text.")
'qAgzZnskrcRgeftk'
.. note::
The ``trafilatura.hashing`` submodule has been renamed ``trafilatura.deduplication`` in version 1.10.0.


Extraction settings
-------------------

Expand Down
2 changes: 1 addition & 1 deletion tests/cli_tests.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,7 @@ def test_parser():
assert e.type == SystemExit
assert e.value.code == 0
assert re.match(
r"Trafilatura [0-9]\.[0-9]\.[0-9] - Python [0-9]\.[0-9]+\.[0-9]", f.getvalue()
r"Trafilatura [0-9]\.[0-9]+\.[0-9] - Python [0-9]\.[0-9]+\.[0-9]", f.getvalue()
)

# test deprecations
Expand Down
2 changes: 1 addition & 1 deletion trafilatura/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
__author__ = 'Adrien Barbaresi and contributors'
__license__ = "Apache-2.0"
__copyright__ = 'Copyright 2019-2024, Adrien Barbaresi'
__version__ = '1.9.0'
__version__ = '1.10.0'


import logging
Expand Down

0 comments on commit b36b6fa

Please sign in to comment.