Skip to content

Commit

Permalink
prepare version 1.6.1 (#371)
Browse files Browse the repository at this point in the history
* prepare release 1.6.1

* fix docs
  • Loading branch information
adbar authored Jun 15, 2023
1 parent b210084 commit d85d584
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 7 deletions.
21 changes: 18 additions & 3 deletions HISTORY.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,22 @@
## History / Changelog


### 1.6.1

Extraction:
- minor fixes: tables in figures (#301), headings (#354) and lists (#318)

Metadata:
- simplify and fully test JSON parsing code, with @felipehertzer (#352, #368)
- authors, JSON and unicode fixes by @felipehertzer in #365
- fix for authors without `additionalName` by @awwitecki in #363

Navigation:
- reviewed link processing in feeds and sitemaps (#340, #350)
- more robust spider (#359)
- updated underlying courlan package (#360)


### 1.6.0

Extraction:
Expand All @@ -15,19 +31,18 @@ Command-line interface:
- more efficient downloads (#338)
- fix for single URL processing (#324) and URL blacklisting (#339)

Navigation
Navigation:
- additional safety check on domain similarity for feeds and sitemaps
- new function ``is_live test()`` using HTTP HEAD request (#327)
- code parts supported by new courlan version

Maintenance
Maintenance:
- allow ``urllib3`` version 2.0+
- minor code simplification and fixes


### 1.5.0


Extraction:
- fixes for metadata extraction with @felipehertzer (#295, #296), @andremacola (#282, #310), and @edkrueger (#303)
- pagetype and image urls added to metadata by @andremacola (#282, #310)
Expand Down
6 changes: 3 additions & 3 deletions docs/sources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ Corpora
URL lists from corpus linguistic projects can be a starting ground to derive information from, either to recreate existing corpora or to re-crawl the websites and find new content. If the websites do not exist anymore, the links can still be useful as the corresponding web pages can be retrieved from web archives.

- `Sources for the Internet Corpora <http://corpus.leeds.ac.uk/internet.html>`_ of the Leeds Centre for Translation Studies
- `Link data sets <https://corporafromtheweb.org/link-data-sets-cc-by/>`_ of the COW project
- `Link data sets <https://www.webcorpora.org/opendata/links/>`_ of the COW project


URL directories
Expand Down Expand Up @@ -95,7 +95,7 @@ Here is how to make this method work in a modular way:
# use the list gathered in (1)
>>> wordlist = ['word1', 'word2', 'word3', 'word4'] # and so on
# draw 3 random words from the list
>>> selection = random.choices(wordlist, k=3)
>>> selection = random.sample(wordlist, k=3)
3. Get URL results from search engines for the random tuples. Here are examples of Python modules to query search engines: `search-engine-parser <https://github.com/bisohns/search-engine-parser>`_ and `GoogleScraper <https://github.com/NikolaiT/GoogleScraper>`_.

Expand Down Expand Up @@ -135,7 +135,7 @@ Previously collected tweet IDs can be “hydrated”, i.e. retrieved from Twitte
- `Twitter datasets for research and archiving <https://tweetsets.library.gwu.edu/>`_
- `Search GitHub for Tweet IDs <https://github.com/search?q=tweet+ids>`_

Links can be extracted from tweets with a regular expression such as ``re.findall(r'https://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…).
Links can be extracted from tweets with a regular expression such as ``re.findall(r'https?://[^ ]+')``. They probably need to be resolved first to get actual link targets and not just shortened URLs (like t.co/…).


For further ideas from previous projects see references below.
Expand Down
2 changes: 1 addition & 1 deletion trafilatura/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@
__author__ = 'Adrien Barbaresi and contributors'
__license__ = 'GNU GPL v3+'
__copyright__ = 'Copyright 2019-2023, Adrien Barbaresi'
__version__ = '1.6.0'
__version__ = '1.6.1'


import logging
Expand Down

0 comments on commit d85d584

Please sign in to comment.