update docs (#437)

* update docs * add change in tutorial (#439) * update requirements
adbar · Nov 8, 2023 · 0ebbbcb · 0ebbbcb
1 parent 817da5d
commit 0ebbbcb
Show file tree

Hide file tree

Showing 7 changed files with 12 additions and 9 deletions.
diff --git a/docs/crawls.rst b/docs/crawls.rst
@@ -114,7 +114,7 @@ On the CLI the crawler automatically works its way through a website, stopping a
 
     $ trafilatura --crawl "https://www.example.org" > links.txt
 
-It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--inputfile`` option).
+It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--input-file`` option).
 
 .. note::
     The ``--list`` option does not apply here. Unlike with the ``--sitemap`` or ``--feed`` options, the URLs are simply returned as a list instead of being retrieved and processed. This happens in order to give a chance to examine the collected URLs prior to further downloads.

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -61,6 +61,9 @@ This project is under active development, please make sure you keep it up-to-dat
 
 On **Mac OS** it can be necessary to install certificates by hand if you get errors like ``[SSL: CERTIFICATE_VERIFY_FAILED]`` while downloading webpages: execute ``pip install certifi`` and perform the post-installation step by clicking on ``/Applications/Python 3.X/Install Certificates.command``. For more information see this `help page on SSL errors <https://stackoverflow.com/questions/27835619/urllib-and-ssl-certificate-verify-failed-error/42334357>`_.
 
+.. hint::
+    Installation on MacOS is generally easier with `brew <https://formulae.brew.sh/formula/trafilatura>`_.
+
 
 Older Python versions
 ~~~~~~~~~~~~~~~~~~~~~

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,6 +1,6 @@
 # with version specifier
 sphinx>=7.2.6
-pydata-sphinx-theme>=0.14.1
+pydata-sphinx-theme>=0.14.3
 docutils>=0.20.1
 # without version specifier
 trafilatura

diff --git a/docs/tutorial-dwds.rst b/docs/tutorial-dwds.rst
@@ -114,7 +114,7 @@ Diese Linkliste kann zunächst gefiltert werden, um deutschsprachige, inhaltsrei
 
 Die Ausgabe von *Trafilatura* erfolgt auf zweierlei Weise: die extrahierten Texte (TXT-Format) im Verzeichnis ``ausgabe`` und eine Kopie der heruntergeladenen Webseiten unter ``html-quellen`` (zur Archivierung und ggf. erneuten Verarbeitung):
 
-``trafilatura --inputfile linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/``
+``trafilatura --input-file linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/``
 
 So werden TXT-Dateien ohne Metadaten ausgegeben. Wenn Sie ``--csv``, ``--json``, ``--xml`` oder ``--xmltei`` hinzufügen, werden Metadaten einbezogen und das entsprechende Format für die Ausgabe bestimmt. Zusätzliche Optionen sind verfügbar, siehe die passenden Dokumentationsseiten.
 

diff --git a/docs/tutorial-epsilla.rst b/docs/tutorial-epsilla.rst
@@ -19,7 +19,7 @@ Text embedding involves converting text into numerical vectors, and is commonly
 - Anomaly detection (identify outliers)
 
 In this tutorial, we will show you how to perform text embedding on results from Trafilatura. We will use
-`Epsilla <https://www.epsilla.com/?ref=trafilatura>`_, an open source vector database for storing and searching vector embeddings. It is 10x faster than regular relational databases for vector operations.
+`Epsilla <https://www.epsilla.com/?ref=trafilatura>`_, an open source vector database for storing and searching vector embeddings. It is 10x faster than regular vector databases for vector operations.
 
 .. note::
     For a hands-on version of this tutorial, try out the `Colab Notebook <https://colab.research.google.com/drive/1eFHO0dHyPhEF9Sm_HXcMFmJZnvP9a-aX?usp=sharing>`_.

diff --git a/docs/tutorial0.rst b/docs/tutorial0.rst
@@ -171,8 +171,8 @@ Seamless download and processing
 
 Two major command line arguments are necessary here:
 
--  ``-i`` or ``--inputfile`` to select an input list to read links from
--  ``-o`` or ``--outputdir`` to define a directory to eventually store the results
+-  ``-i`` or ``--input-file`` to select an input list to read links from
+-  ``-o`` or ``--output-dir`` to define a directory to eventually store the results
 
 An additional argument can be useful in this context:
 
@@ -213,6 +213,6 @@ Alternatively, you can download a series of web documents with generic command-l
     # download if necessary
     $ wget --directory-prefix=download/ --wait 5 --input-file=mylist.txt
     # process a directory with archived HTML files
-    $ trafilatura --inputdir download/ --outputdir corpus/ --xmltei --nocomments
+    $ trafilatura --input-dir download/ --output-dir corpus/ --xmltei --no-comments
 
 
diff --git a/docs/tutorial1.rst b/docs/tutorial1.rst
@@ -26,8 +26,8 @@ For the collection and filtering of links see `this tutorial <tutorial0.html>`_
 
 Two major options are necessary here:
 
--  ``-i`` or ``--inputfile`` to select an input list to read links from
--  ``-o`` or ``--outputdir`` to define a directory to eventually store the results
+-  ``-i`` or ``--input-file`` to select an input list to read links from
+-  ``-o`` or ``--output-dir`` to define a directory to eventually store the results
 
 The input list will be read sequentially, and only lines beginning with a valid URL will be read; any other information contained in the file will be discarded.