Skip to content
Juho Inkinen edited this page Jan 23, 2023 · 10 revisions

The tfidf backend implements a baseline algorithm for automated subject indexing. The idea is to count the frequencies of terms (words) used in documents about each subject, use the TF-IDF algorithm to weight the term frequencies so that rare words are more important than frequently occurring ones, and to create an index for matching term frequencies in new documents to those about specific subjects. The implementation is based on the topic modelling library Gensim.

It is really easy to get started using the TF-IDF backend since it doesn't require any algorithm-specific configuration.

See also the Annif-tutorial exercise about TFIDF project.

Example configuration

[tfidf-en]
name=TF-IDF English
language=en
backend=tfidf
analyzer=snowball(english)
limit=100
vocab=yso

Usage

Load a vocabulary:

annif load-vocab yso /path/to/Annif-corpora/vocab/yso-skos.ttl

Train the model:

annif train tfidf-en /path/to/Annif-corpora/training/yso-finna-en.tsv.gz

Test the model with a single document:

cat document.txt | annif suggest tfidf-en

Evaluate a directory full of files in fulltext document corpus format:

annif eval tfidf-en /path/to/documents/
Clone this wiki locally