You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In my use case for indexing short bits of text with "annif index" it is rather clumsy to produce thousands of short text files for indexing, process them and read the output text files again into some database. With short bits of text I mean something from one word to sixty words maximum. I actually have under ten words. In this case the required number of keywords would be very small, I would say four at maximum. Annif index lets you specify the number of keywords.
The input .csv file would have one line per text with an identifier (either a code or a URI) and the text. The output in .csv would have this data, plus: the URI's of the vocabulary, the similarity score and optionally the labels of the vocabulary. The output file name could be a versioned form of the project_id or something you can define as output name. There is one kind of structure in the output that is reasonable easy to reuse:
Add as many rows as there are vocabulary URI's, scores and labels to the data file and copy the original input on these rows too. ID being the original identifier, text the original text to index, vocab_uri the URI of the vocabulary, label the label of this URI and score the similarity or confidence score. The input would then have consisted of the first two columns, but just one row.
Other output structures like adding as many columns as URI's and labels as there will be, introduce problems with interpretation (which labels belong to which URI's?) and processing (you have to find out how many result columns there are and what they mean). This structure is in my view only adequate if you just want URI's, no labels and scores.
You could also add the URI's in one column, comma separated. But this introduces an extra handicap in processing the result file. This might be a more acceptable alternative, when you are satisfied with just the URI's. Another column for all the scores would be possible and would follow the order of the URI's. Again this introduces extra scripting processing effort.
The text was updated successfully, but these errors were encountered:
Thanks for the suggestion and especially your thoughts on the input and output file formats, which seem very reasonable. Right now I cannot promise anything about implementation, but I don't see this as very complicated, it's just a question of priorities. Also, it would be helpful to know if others are in the same situation and would find this useful. We don't actually use the annif index command in its current form at all, as it's more straightforward to use the REST API for bulk indexing.
In my use case for indexing short bits of text with "annif index" it is rather clumsy to produce thousands of short text files for indexing, process them and read the output text files again into some database. With short bits of text I mean something from one word to sixty words maximum. I actually have under ten words. In this case the required number of keywords would be very small, I would say four at maximum. Annif index lets you specify the number of keywords.
The input .csv file would have one line per text with an identifier (either a code or a URI) and the text. The output in .csv would have this data, plus: the URI's of the vocabulary, the similarity score and optionally the labels of the vocabulary. The output file name could be a versioned form of the project_id or something you can define as output name. There is one kind of structure in the output that is reasonable easy to reuse:
Add as many rows as there are vocabulary URI's, scores and labels to the data file and copy the original input on these rows too. ID being the original identifier, text the original text to index, vocab_uri the URI of the vocabulary, label the label of this URI and score the similarity or confidence score. The input would then have consisted of the first two columns, but just one row.
Other output structures like adding as many columns as URI's and labels as there will be, introduce problems with interpretation (which labels belong to which URI's?) and processing (you have to find out how many result columns there are and what they mean). This structure is in my view only adequate if you just want URI's, no labels and scores.
You could also add the URI's in one column, comma separated. But this introduces an extra handicap in processing the result file. This might be a more acceptable alternative, when you are satisfied with just the URI's. Another column for all the scores would be possible and would follow the order of the URI's. Again this introduces extra scripting processing effort.
The text was updated successfully, but these errors were encountered: