From 0ebbbcbf0ae2506cf81bb9580869b8bc3b66ecd9 Mon Sep 17 00:00:00 2001 From: Adrien Barbaresi Date: Wed, 8 Nov 2023 16:59:23 +0100 Subject: [PATCH] update docs (#437) * update docs * add change in tutorial (#439) * update requirements --- docs/crawls.rst | 2 +- docs/installation.rst | 3 +++ docs/requirements.txt | 2 +- docs/tutorial-dwds.rst | 2 +- docs/tutorial-epsilla.rst | 2 +- docs/tutorial0.rst | 6 +++--- docs/tutorial1.rst | 4 ++-- 7 files changed, 12 insertions(+), 9 deletions(-) diff --git a/docs/crawls.rst b/docs/crawls.rst index 01cd949e..165fab52 100644 --- a/docs/crawls.rst +++ b/docs/crawls.rst @@ -114,7 +114,7 @@ On the CLI the crawler automatically works its way through a website, stopping a $ trafilatura --crawl "https://www.example.org" > links.txt -It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--inputfile`` option). +It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--input-file`` option). .. note:: The ``--list`` option does not apply here. Unlike with the ``--sitemap`` or ``--feed`` options, the URLs are simply returned as a list instead of being retrieved and processed. This happens in order to give a chance to examine the collected URLs prior to further downloads. diff --git a/docs/installation.rst b/docs/installation.rst index ae4b46f7..a83acc8f 100644 --- a/docs/installation.rst +++ b/docs/installation.rst @@ -61,6 +61,9 @@ This project is under active development, please make sure you keep it up-to-dat On **Mac OS** it can be necessary to install certificates by hand if you get errors like ``[SSL: CERTIFICATE_VERIFY_FAILED]`` while downloading webpages: execute ``pip install certifi`` and perform the post-installation step by clicking on ``/Applications/Python 3.X/Install Certificates.command``. For more information see this `help page on SSL errors `_. +.. hint:: + Installation on MacOS is generally easier with `brew `_. + Older Python versions ~~~~~~~~~~~~~~~~~~~~~ diff --git a/docs/requirements.txt b/docs/requirements.txt index 24f72038..43d13278 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -1,6 +1,6 @@ # with version specifier sphinx>=7.2.6 -pydata-sphinx-theme>=0.14.1 +pydata-sphinx-theme>=0.14.3 docutils>=0.20.1 # without version specifier trafilatura diff --git a/docs/tutorial-dwds.rst b/docs/tutorial-dwds.rst index c9e871c8..a45aeb44 100644 --- a/docs/tutorial-dwds.rst +++ b/docs/tutorial-dwds.rst @@ -114,7 +114,7 @@ Diese Linkliste kann zunächst gefiltert werden, um deutschsprachige, inhaltsrei Die Ausgabe von *Trafilatura* erfolgt auf zweierlei Weise: die extrahierten Texte (TXT-Format) im Verzeichnis ``ausgabe`` und eine Kopie der heruntergeladenen Webseiten unter ``html-quellen`` (zur Archivierung und ggf. erneuten Verarbeitung): -``trafilatura --inputfile linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/`` +``trafilatura --input-file linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/`` So werden TXT-Dateien ohne Metadaten ausgegeben. Wenn Sie ``--csv``, ``--json``, ``--xml`` oder ``--xmltei`` hinzufügen, werden Metadaten einbezogen und das entsprechende Format für die Ausgabe bestimmt. Zusätzliche Optionen sind verfügbar, siehe die passenden Dokumentationsseiten. diff --git a/docs/tutorial-epsilla.rst b/docs/tutorial-epsilla.rst index e723dbd2..1c491f79 100644 --- a/docs/tutorial-epsilla.rst +++ b/docs/tutorial-epsilla.rst @@ -19,7 +19,7 @@ Text embedding involves converting text into numerical vectors, and is commonly - Anomaly detection (identify outliers) In this tutorial, we will show you how to perform text embedding on results from Trafilatura. We will use -`Epsilla `_, an open source vector database for storing and searching vector embeddings. It is 10x faster than regular relational databases for vector operations. +`Epsilla `_, an open source vector database for storing and searching vector embeddings. It is 10x faster than regular vector databases for vector operations. .. note:: For a hands-on version of this tutorial, try out the `Colab Notebook `_. diff --git a/docs/tutorial0.rst b/docs/tutorial0.rst index 504509dc..19e42e1c 100644 --- a/docs/tutorial0.rst +++ b/docs/tutorial0.rst @@ -171,8 +171,8 @@ Seamless download and processing Two major command line arguments are necessary here: -- ``-i`` or ``--inputfile`` to select an input list to read links from -- ``-o`` or ``--outputdir`` to define a directory to eventually store the results +- ``-i`` or ``--input-file`` to select an input list to read links from +- ``-o`` or ``--output-dir`` to define a directory to eventually store the results An additional argument can be useful in this context: @@ -213,6 +213,6 @@ Alternatively, you can download a series of web documents with generic command-l # download if necessary $ wget --directory-prefix=download/ --wait 5 --input-file=mylist.txt # process a directory with archived HTML files - $ trafilatura --inputdir download/ --outputdir corpus/ --xmltei --nocomments + $ trafilatura --input-dir download/ --output-dir corpus/ --xmltei --no-comments diff --git a/docs/tutorial1.rst b/docs/tutorial1.rst index f28041da..372cca86 100644 --- a/docs/tutorial1.rst +++ b/docs/tutorial1.rst @@ -26,8 +26,8 @@ For the collection and filtering of links see `this tutorial `_ Two major options are necessary here: -- ``-i`` or ``--inputfile`` to select an input list to read links from -- ``-o`` or ``--outputdir`` to define a directory to eventually store the results +- ``-i`` or ``--input-file`` to select an input list to read links from +- ``-o`` or ``--output-dir`` to define a directory to eventually store the results The input list will be read sequentially, and only lines beginning with a valid URL will be read; any other information contained in the file will be discarded.