prepare version 1.6.1 (#371)

* prepare release 1.6.1 * fix docs
adbar · Jun 15, 2023 · d85d584 · d85d584
1 parent b210084
commit d85d584
Show file tree

Hide file tree

Showing 3 changed files with 22 additions and 7 deletions.
diff --git a/HISTORY.md b/HISTORY.md
@@ -1,6 +1,22 @@
 ## History / Changelog
 
 
+### 1.6.1
+
+Extraction:
+- minor fixes: tables in figures (#301), headings (#354) and lists (#318)
+
+Metadata:
+- simplify and fully test JSON parsing code, with @felipehertzer (#352, #368)
+- authors, JSON and unicode fixes by @felipehertzer in #365
+- fix for authors without `additionalName` by @awwitecki in #363
+
+Navigation:
+- reviewed link processing in feeds and sitemaps (#340, #350)
+- more robust spider (#359)
+- updated underlying courlan package (#360)
+
+
 ### 1.6.0
 
 Extraction:
@@ -15,19 +31,18 @@ Command-line interface:
 - more efficient downloads (#338)
 - fix for single URL processing (#324) and URL blacklisting (#339)
 
-Navigation
+Navigation:
 - additional safety check on domain similarity for feeds and sitemaps
 - new function ``is_live test()`` using HTTP HEAD request (#327)
 - code parts supported by new courlan version
 
-Maintenance
+Maintenance:
 - allow ``urllib3`` version 2.0+
 - minor code simplification and fixes
 
 
 ### 1.5.0
 
-
 Extraction:
 - fixes for metadata extraction with @felipehertzer (#295, #296),  @andremacola (#282, #310), and @edkrueger (#303)
 - pagetype and image urls added to metadata by @andremacola (#282, #310)

diff --git a/docs/sources.rst b/docs/sources.rst
@@ -39,7 +39,7 @@ Corpora
 URL lists from corpus linguistic projects can be a starting ground to derive information from, either to recreate existing corpora or to re-crawl the websites and find new content. If the websites do not exist anymore, the links can still be useful as the corresponding web pages can be retrieved from web archives.
 
 - `Sources for the Internet Corpora <http://corpus.leeds.ac.uk/internet.html>`_ of the Leeds Centre for Translation Studies
-- `Link data sets <https://corporafromtheweb.org/link-data-sets-cc-by/>`_  of the COW project
+- `Link data sets <https://www.webcorpora.org/opendata/links/>`_  of the COW project
 
 
 URL directories
@@ -95,7 +95,7 @@ Here is how to make this method work in a modular way:
     # use the list gathered in (1)
     >>> wordlist = ['word1', 'word2', 'word3', 'word4']  # and so on
     # draw 3 random words from the list
-    >>> selection = random.choices(wordlist, k=3)
+    >>> selection = random.sample(wordlist, k=3)
 
 3. Get URL results from search engines for the random tuples. Here are examples of Python modules to query search engines: `search-engine-parser <https://github.com/bisohns/search-engine-parser>`_ and `GoogleScraper <https://github.com/NikolaiT/GoogleScraper>`_.
 
@@ -135,7 +135,7 @@ Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitte
 - `Twitter datasets for research and archiving <https://tweetsets.library.gwu.edu/>`_
 - `Search GitHub for Tweet IDs <https://github.com/search?q=tweet+ids>`_
 
-Links can be extracted from tweets with a regular expression such as ``re.findall(r'https://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…).
+Links can be extracted from tweets with a regular expression such as ``re.findall(r'https?://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…).
 
 
 For further ideas from previous projects see references below.

diff --git a/trafilatura/__init__.py b/trafilatura/__init__.py
@@ -9,7 +9,7 @@
 __author__ = 'Adrien Barbaresi and contributors'
 __license__ = 'GNU GPL v3+'
 __copyright__ = 'Copyright 2019-2023, Adrien Barbaresi'
-__version__ = '1.6.0'
+__version__ = '1.6.1'
 
 
 import logging