-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BERT Backend #625
Comments
Hello @lunactic , thank you for the suggestion! There is already some work being done to integrate Annif with language models, mainly by integrating the XTransformer model from PECOS in PR #540 by @mo-fu . But what you propose seems somewhat different. The idea of semantic search is not new but actually implemented already in the simplest Annif backend, Do you have any idea of how accurate this kind of model could be, for example for Dewey classification? Did you compare it with other approaches? I've had quite good results on DDC classification with SVC and Omikuji Bonsai, which both achieve pretty similar accuracies. If your approach (which would undoubtedly be way more resource-intensive) would be more accurate than this "baseline", that would be interesting and support the idea of integrating it with Annif. As I understand it, XTransformer is specifically tailored for extreme multi-label classification problems, which are typically very challenging because of large vocabularies (many classes/labels), big training corpora with skewed distributions etc. You may want to look at that as well - the PR is already usable and the documentation for how to use can be found in the comments on GitHub. |
Just wanted to add some reading material for semantic search on dense word vectors:
As mentioned by @osma this does not yet handle the label distribution issue of XML problems. But can probably combined with the clustering techniques in Parabel/Bonsai. The Omikuji library even has the option to only learn the label tree. |
Hello
I am currently working at Swiss National Library experimenting with Annif for the automatic generation of Dewey numbers.
In that process I started experimenting with BERT approaches as explained here: https://www.sbert.net/examples/applications/semantic-search/README.html#semantic-search
First tests indicate that this approach could work very well. Would this be interesting to the whole Annif community. If yes then I could check if I'll find the time for creating a PR that implements this as a backend for Annif.
The approach that I would follow is to to create the embeddings for the training corpus when
annif train
is used and store them as a pickle file for use later.The methodology would also allow for "retraining", meaning embeddings of new documents could simply be appended to the existing training corpus.
The text was updated successfully, but these errors were encountered: