diff --git a/HISTORY.md b/HISTORY.md index 0f77a9fb..de5512c4 100644 --- a/HISTORY.md +++ b/HISTORY.md @@ -1,6 +1,22 @@ ## History / Changelog +### 1.6.1 + +Extraction: +- minor fixes: tables in figures (#301), headings (#354) and lists (#318) + +Metadata: +- simplify and fully test JSON parsing code, with @felipehertzer (#352, #368) +- authors, JSON and unicode fixes by @felipehertzer in #365 +- fix for authors without `additionalName` by @awwitecki in #363 + +Navigation: +- reviewed link processing in feeds and sitemaps (#340, #350) +- more robust spider (#359) +- updated underlying courlan package (#360) + + ### 1.6.0 Extraction: @@ -15,19 +31,18 @@ Command-line interface: - more efficient downloads (#338) - fix for single URL processing (#324) and URL blacklisting (#339) -Navigation +Navigation: - additional safety check on domain similarity for feeds and sitemaps - new function ``is_live test()`` using HTTP HEAD request (#327) - code parts supported by new courlan version -Maintenance +Maintenance: - allow ``urllib3`` version 2.0+ - minor code simplification and fixes ### 1.5.0 - Extraction: - fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303) - pagetype and image urls added to metadata by @andremacola (#282, #310) diff --git a/docs/sources.rst b/docs/sources.rst index d4960069..d0f1844c 100644 --- a/docs/sources.rst +++ b/docs/sources.rst @@ -39,7 +39,7 @@ Corpora URL lists from corpus linguistic projects can be a starting ground to derive information from, either to recreate existing corpora or to re-crawl the websites and find new content. If the websites do not exist anymore, the links can still be useful as the corresponding web pages can be retrieved from web archives. - `Sources for the Internet Corpora `_ of the Leeds Centre for Translation Studies -- `Link data sets `_ of the COW project +- `Link data sets `_ of the COW project URL directories @@ -95,7 +95,7 @@ Here is how to make this method work in a modular way: # use the list gathered in (1) >>> wordlist = ['word1', 'word2', 'word3', 'word4'] # and so on # draw 3 random words from the list - >>> selection = random.choices(wordlist, k=3) + >>> selection = random.sample(wordlist, k=3) 3. Get URL results from search engines for the random tuples. Here are examples of Python modules to query search engines: `search-engine-parser `_ and `GoogleScraper `_. @@ -135,7 +135,7 @@ Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitte - `Twitter datasets for research and archiving `_ - `Search GitHub for Tweet IDs `_ -Links can be extracted from tweets with a regular expression such as ``re.findall(r'https://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…). +Links can be extracted from tweets with a regular expression such as ``re.findall(r'https?://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…). For further ideas from previous projects see references below. diff --git a/trafilatura/__init__.py b/trafilatura/__init__.py index d7b27974..69ed4f97 100644 --- a/trafilatura/__init__.py +++ b/trafilatura/__init__.py @@ -9,7 +9,7 @@ __author__ = 'Adrien Barbaresi and contributors' __license__ = 'GNU GPL v3+' __copyright__ = 'Copyright 2019-2023, Adrien Barbaresi' -__version__ = '1.6.0' +__version__ = '1.6.1' import logging