Skip to content

Commit

Permalink
update docs (#437)
Browse files Browse the repository at this point in the history
* update docs

* add change in tutorial (#439)

* update requirements
  • Loading branch information
adbar authored Nov 8, 2023
1 parent 817da5d commit 0ebbbcb
Show file tree
Hide file tree
Showing 7 changed files with 12 additions and 9 deletions.
2 changes: 1 addition & 1 deletion docs/crawls.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ On the CLI the crawler automatically works its way through a website, stopping a
$ trafilatura --crawl "https://www.example.org" > links.txt
It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--inputfile`` option).
It can also crawl websites in parallel by reading a list of target sites from a list (``-i``/``--input-file`` option).

.. note::
The ``--list`` option does not apply here. Unlike with the ``--sitemap`` or ``--feed`` options, the URLs are simply returned as a list instead of being retrieved and processed. This happens in order to give a chance to examine the collected URLs prior to further downloads.
Expand Down
3 changes: 3 additions & 0 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,9 @@ This project is under active development, please make sure you keep it up-to-dat
On **Mac OS** it can be necessary to install certificates by hand if you get errors like ``[SSL: CERTIFICATE_VERIFY_FAILED]`` while downloading webpages: execute ``pip install certifi`` and perform the post-installation step by clicking on ``/Applications/Python 3.X/Install Certificates.command``. For more information see this `help page on SSL errors <https://stackoverflow.com/questions/27835619/urllib-and-ssl-certificate-verify-failed-error/42334357>`_.

.. hint::
Installation on MacOS is generally easier with `brew <https://formulae.brew.sh/formula/trafilatura>`_.


Older Python versions
~~~~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# with version specifier
sphinx>=7.2.6
pydata-sphinx-theme>=0.14.1
pydata-sphinx-theme>=0.14.3
docutils>=0.20.1
# without version specifier
trafilatura
Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial-dwds.rst
Original file line number Diff line number Diff line change
Expand Up @@ -114,7 +114,7 @@ Diese Linkliste kann zunächst gefiltert werden, um deutschsprachige, inhaltsrei

Die Ausgabe von *Trafilatura* erfolgt auf zweierlei Weise: die extrahierten Texte (TXT-Format) im Verzeichnis ``ausgabe`` und eine Kopie der heruntergeladenen Webseiten unter ``html-quellen`` (zur Archivierung und ggf. erneuten Verarbeitung):

``trafilatura --inputfile linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/``
``trafilatura --input-file linkliste.txt --outputdir ausgabe/ --backup-dir html-quellen/``

So werden TXT-Dateien ohne Metadaten ausgegeben. Wenn Sie ``--csv``, ``--json``, ``--xml`` oder ``--xmltei`` hinzufügen, werden Metadaten einbezogen und das entsprechende Format für die Ausgabe bestimmt. Zusätzliche Optionen sind verfügbar, siehe die passenden Dokumentationsseiten.

Expand Down
2 changes: 1 addition & 1 deletion docs/tutorial-epsilla.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ Text embedding involves converting text into numerical vectors, and is commonly
- Anomaly detection (identify outliers)

In this tutorial, we will show you how to perform text embedding on results from Trafilatura. We will use
`Epsilla <https://www.epsilla.com/?ref=trafilatura>`_, an open source vector database for storing and searching vector embeddings. It is 10x faster than regular relational databases for vector operations.
`Epsilla <https://www.epsilla.com/?ref=trafilatura>`_, an open source vector database for storing and searching vector embeddings. It is 10x faster than regular vector databases for vector operations.

.. note::
For a hands-on version of this tutorial, try out the `Colab Notebook <https://colab.research.google.com/drive/1eFHO0dHyPhEF9Sm_HXcMFmJZnvP9a-aX?usp=sharing>`_.
Expand Down
6 changes: 3 additions & 3 deletions docs/tutorial0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -171,8 +171,8 @@ Seamless download and processing

Two major command line arguments are necessary here:

- ``-i`` or ``--inputfile`` to select an input list to read links from
- ``-o`` or ``--outputdir`` to define a directory to eventually store the results
- ``-i`` or ``--input-file`` to select an input list to read links from
- ``-o`` or ``--output-dir`` to define a directory to eventually store the results

An additional argument can be useful in this context:

Expand Down Expand Up @@ -213,6 +213,6 @@ Alternatively, you can download a series of web documents with generic command-l
# download if necessary
$ wget --directory-prefix=download/ --wait 5 --input-file=mylist.txt
# process a directory with archived HTML files
$ trafilatura --inputdir download/ --outputdir corpus/ --xmltei --nocomments
$ trafilatura --input-dir download/ --output-dir corpus/ --xmltei --no-comments
4 changes: 2 additions & 2 deletions docs/tutorial1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -26,8 +26,8 @@ For the collection and filtering of links see `this tutorial <tutorial0.html>`_

Two major options are necessary here:

- ``-i`` or ``--inputfile`` to select an input list to read links from
- ``-o`` or ``--outputdir`` to define a directory to eventually store the results
- ``-i`` or ``--input-file`` to select an input list to read links from
- ``-o`` or ``--output-dir`` to define a directory to eventually store the results

The input list will be read sequentially, and only lines beginning with a valid URL will be read; any other information contained in the file will be discarded.

Expand Down

0 comments on commit 0ebbbcb

Please sign in to comment.