Use as source for "annif index" one file (.csv, .tsv) #639

hekl · 2022-11-01T15:49:28Z

In my use case for indexing short bits of text with "annif index" it is rather clumsy to produce thousands of short text files for indexing, process them and read the output text files again into some database. With short bits of text I mean something from one word to sixty words maximum. I actually have under ten words. In this case the required number of keywords would be very small, I would say four at maximum. Annif index lets you specify the number of keywords.
The input .csv file would have one line per text with an identifier (either a code or a URI) and the text. The output in .csv would have this data, plus: the URI's of the vocabulary, the similarity score and optionally the labels of the vocabulary. The output file name could be a versioned form of the project_id or something you can define as output name. There is one kind of structure in the output that is reasonable easy to reuse:

Add as many rows as there are vocabulary URI's, scores and labels to the data file and copy the original input on these rows too. ID being the original identifier, text the original text to index, vocab_uri the URI of the vocabulary, label the label of this URI and score the similarity or confidence score. The input would then have consisted of the first two columns, but just one row.

id;text;vocab_uri;vocab_label;score
D003018;prices of consumer products rising high;https://vocab/id/1010;prices;1.0
D003018;prices of consumer products rising high;https://vocab/id/1013;consumer products;1.0

Other output structures like adding as many columns as URI's and labels as there will be, introduce problems with interpretation (which labels belong to which URI's?) and processing (you have to find out how many result columns there are and what they mean). This structure is in my view only adequate if you just want URI's, no labels and scores.

You could also add the URI's in one column, comma separated. But this introduces an extra handicap in processing the result file. This might be a more acceptable alternative, when you are satisfied with just the URI's. Another column for all the scores would be possible and would follow the order of the URI's. Again this introduces extra scripting processing effort.

osma · 2022-11-02T08:18:25Z

Thanks for the suggestion and especially your thoughts on the input and output file formats, which seem very reasonable. Right now I cannot promise anything about implementation, but I don't see this as very complicated, it's just a question of priorities. Also, it would be helpful to know if others are in the same situation and would find this useful. We don't actually use the annif index command in its current form at all, as it's more straightforward to use the REST API for bulk indexing.

osma added the enhancement label Nov 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use as source for "annif index" one file (.csv, .tsv) #639

Use as source for "annif index" one file (.csv, .tsv) #639

hekl commented Nov 1, 2022 •

edited

Loading

osma commented Nov 2, 2022

Use as source for "annif index" one file (.csv, .tsv) #639

Use as source for "annif index" one file (.csv, .tsv) #639

Comments

hekl commented Nov 1, 2022 • edited Loading

osma commented Nov 2, 2022

hekl commented Nov 1, 2022 •

edited

Loading