Releases: NatLibFi/Annif
Annif 1.2
This release introduces language detection capabilities in the REST API and CLI, improves 🤗 Hugging Face Hub integration, and also includes the usual maintenance work and minor bug fixes.
The new REST API endpoint /v1/detect-language
expects POST requests that contain a JSON object with the text whose language is to be analyzed and a list of candidate languages. Similarly, the CLI has a new command annif detect-language
. Annif projects are typically language specific, so a text of a given language needs to be processed with a project intended for that language; the language detection feature can help in this. For details see this Wiki page. The language detection is performed with the Simplemma library by @adbar et al.
The annif download
command has a new --trust-repo
option, which needs to be used if the repository to download from has not been used previously (that is if the repository does not appear in the local Hugging Face Hub cache). This option is introduced to raise awareness of the risks of downloading projects from the internet; the project downloads should only be done from trusted sources. For more information see the Hugging Face Hub documentation.
This release also includes automation of downloading the NLTK datapackage used for tokenization to simplify Annif installation. Maintenance tasks include upgrading dependencies, including a new version of Simplemma that allows better control over memory usage. The bug fixes include restoring the --host
option of the annif run
command.
Python 3.12 is now fully supported (previously NN-ensemble and STWFSA backends were not supported on Python 3.12).
Supported Python versions:
- 3.9, 3.10,. 3.11 and 3.12
Backward compatibility:
- NN ensemble projects trained with Annif v1.1 or older need to be retrained.
- For other projects, the warnings by SciKit-learn are harmless.
Enhancements
#659/#799/#800/#801/#802 Language detection in REST API and CLI
#779 Python 3.12 support
#790/#793 Automatically add metadata to Hugging Face Hub repos when uploading projects
#809 Make field widths variable in the projects list of the Hugging Face Hub Model Card
#803 Automate NLTK datapackage punkt_tab
download
#807 Add --trust-repo
option to download
CLI command
Maintenance
#724 Upgrade Simplemma & limit its memory usage
#796 Update dependencies for 1.2 release
#797/#811 Bump the github-actions versions
#805 Upgrade Docker baseimage to Python 3.12
Bug fixes
#788 Add --host
option to annif run
(credit: @dwinston)
#792 Fix limit parameter not passed to requests by HTTP backend
#808 Fix missing Hugging Face Hub token from preupload_lfs_files() parameters
Annif 1.1
This release introduces CLI commands to share projects via Hugging Face Hub, takes care of various maintenance tasks and fixes minor bugs.
The 🤗 Hugging Face Hub intends to facilitate the sharing of AI models and datasets, and now Annif CLI includes upload
and download
commands, which can be used to push and pull a set of selected projects and vocabularies to and from a Hugging Face Hub repository. In this release these commands are regarded experimental; there can be changes in them in the future. See this Wiki page for more information about the commands. See also this Hugging Face Hub collection which contains the projects served at Finto AI.
Connexion dependency is upgraded to Connexion 3. From now on, when running Annif with Gunicorn, it is required to use Uvicorn workers; the workers can be set using the option --worker-class uvicorn.workers.UvicornWorker
, see Connexion 3 documentation for more details. However, Docker image users do not have to add this option because an enviroment variable in the Docker image sets the worker-class. Two changes due to the upgrade to Connexion 3 relate to the REST API:
- the header
Access-Control-Allow-Origin: *
is now included in the response only if there's an Origin header in the request, whereas before that header was sent if the Origin header was not present in the request, - the URL
/v1/projects/
used to give a 404 response, but now it redirects to the correct URL/v1/projects
.
Support for Python 3.8 is removed. Python 3.12 is supported except for NN-ensemble and STWFSA backends.
It is now possible to select the projects that Annif loads on startup using the environment variable ANNIF_PROJECTS_INIT
, which can be useful in container environments as this allows distributing resource demand across multiple Annif processes.
Supported Python versions
- 3.9, 3.10 and 3.11 are fully supported
- 3.12 is supported except NN-ensemble and STWFSA backends
Backward compatibility
- NN ensemble projects trained with Annif v1.0 or older need to be retrained; for other projects the warnings by SciKit-learn are harmless
- When using Annif with Gunicorn HTTP server the worker class needs to be set to Uvicorn with the option
--worker-class uvicorn.workers.UvicornWorker
Enhancements
#762/#760 Implement annif upload
and annif download
commands for Hugging Face Hub integration
#774/#733 Allow loading selected projects using environment variable
#736 Optimization: load a vocabulary only once even if used in different languages
#745 Show Annif version in WebUI
#751 Create SECURITY.md
Maintenance
#702/#689/#698 Upgrade to Connexion3
#780 Add partial Python 3.12 support
#770 Drop Python 3.8 support
#771/#786 Update dependencies for v1.1 release
#739 Harden GitHub Actions
#781 Make Dependabot group GitHub Actions updates into one PR
#740-#744/#750/#757/#758/#763-#766/#783 Upgrade GitHub Actions
Bug fixes
#784/#785 Add informational error message for failed loading of nn-ensemble model
#732 Fix: Add missing completion command to commands list in RTD
#773 Fix blocked http-request for version number on https site
#778 Fix project data files detection
#752 Fix tests error due to pinned Schemathesis version 3.19.* / Docker rebuild
#759 Fix installation on Python 3.8 due to missing Tensorflow-io wheel
#767 Fix tests and Docker rebuild due to defunct Schemathesis and pytest dependencies resolution
#768 Fix ReadTheDocs builds by upgrading docs build dependencies
Annif 1.0.2
Annif 1.0.1
This is a patch release that fixes a bug arisen after Annif 1.0 release.
The bug affected only running unit tests, but the side-effect was that it also prevented rebuilding the Docker image of version 1.0.
Bugs fixed:
#747/#752 Tests error due to pinned schemathesis version 3.19.* / Docker rebuild fails
Annif 1.0
We are excited to introduce Annif version 1.0!
Advancing the version number to the 1.x series means that Annif is considered ready for more general, production use. The upcoming releases in the series (patches 1.0.x and minor feature releases 1.x.x) will be backward compatible, following the semantic versioning principle. See a Wiki page describing the aspects of the compatibility.
The changes in this release include enhancements to the command-line interface as well as many bug fixes and maintenance updates. The CLI commands, options and most parameters can now be tab-completed when the support is enabled: see instructions in README.md. Also the CLI startup time has been optimized, and the output of many commands has been refined.
Python 3.11 is now mostly supported; the Omikuji backend cannot yet be used on Python 3.11 because the Omikuji library does not support it at the moment.
From now on the Docker image of the latest release in the quay.io repository is going to be rebuilt from time to time in order to apply security updates to the image. The rebuilds will not change Annif itself. Version tags (<major>.<minor>[.<patch>]
) can be used to reference the latest build of the version. To allow more strict pinning to a particular build, the images will also be tagged with the build date as a suffix: <major>.<minor>.<patch>-<YYYYMMDD>
.
Supported Python versions:
- 3.8, 3.9 and 3.10 are fully supported
- 3.11 is supported except Omikuji backend
Backward compatibility:
- MLLM, STWFSA and NN ensemble projects trained with Annif v0.61 or older need to be retrained; for other projects the warnings by SciKit-learn are harmless
- Using STWFSA backend now requires installing an optional dependency
New features:
#684/#693 Support for CLI command completions
#703/#727 Python 3.11 support
Improvements:
#696 Optimize CLI startup time
#686/#694 Improve outputs of project inspection CLI commands
#704 Show scores in outputs of suggest, eval and index with only 4 decimals
Maintenance:
#690/#708 Use Python type hints
#699/#700 Make stwfsapy an optional dependency (credit: @cbartz)
#315/#712/#714 Add CI/CD job for testing Docker image
#707/#711 Ensure system packages are up-to-date in Docker image
#715 Add CI/CD workflow for rebuilding Docker image
#706/#725 Test CLI startup time with CI/CD job
#723 Update ReadTheDocs documentation
#726/#697/#532 Update and pin dependencies v1.0
#730 Switch to Keras v3 save format for nn_ensemble
#731 Upgrade Docker baseimage to Debian Bookworm
Bug fixes:
#705 Fix crashing index command when targeted directory contains subject files
#717 Fix Python version in GitHub Actions CI/CD pipeline
#718 Fix missing limit parameter in STWFSA backend
#722 Fix train state and modification time for unfinished project training
#720/#721 Suppress TensorFlow info messages to debug level
#695 Fix displaying of modification time for null value in Web UI project information
#701 Remove duplicated fasttext entry in optional dependencies list in Dockerfile
#728 Avoid PytestUnknownMarkWarning due to "slow" marker
#729 Avoid scikit-learn UserWarning for vectorizer parameter token_pattern
Other:
#616 Discussion on semantic versioning for Annif releases beyond 1.0
Annif 0.61
The main improvements in this release are internal changes to allow batch processing of documents for better suggestion performance and the streamlining of suggestion result representation by using sparse arrays. Currently batched processing of documents is implemented in the Omikuji, SVC, and all ensemble backends. Also a new REST API method for suggesting subjects for multiple documents has been added.
The new REST API method /v1/projects/{project_id}/suggest-batch
accepts at most 32 documents in one POST request; the documents in the batch are processed in parallel when the used backend provides support for this. The request body is given in JSON format and, like in the case of the regular single-document suggest method, the limit, threshold and language parameters are optional and can be given as URL query parameters. For details see the interactive OpenAPI documention of the REST API of annif.org.
The annif suggest
CLI command is augmented to accept path(s) to file(s) to be processed, in addition to stdin, to enable it to operate on multiple documents. The annif optimize
command is now much faster than before and supports using a --jobs
parameter for parallel processing.
The Annif Docker image has been updated to use Python 3.10.
Also various maintenance tasks have been performed, for example, the default branch of the git repository has been renamed from master
to main
, the Schemathesis tool has been introduced for testing the REST API and many dependendencies have been updated. A bug causing a memory leak in the neural network ensemble backend bas been fixed.
The next release of Annif will be version 1.0. For this purpose we have opened the issue #616 for discussing the expectations of backward compatibility and Semantic Versioning in releases beyond 1.0.
Backward compatibility:
- Models trained with Annif v0.60 should remain working; the warnings by SciKit-learn are harmless
- LRAP metric has been removed from evaluation results
New features:
#664 Add REST API method /v1/projects/{project_id}/suggest-batch
#663 Support for batch suggest operations for CLI commands
#423/#681 Parallelize optimize command
Improvements:
#678/#681 Represent suggestion results as sparse arrays
#665/#669 Batch suggest in Omikuji backend
#667/#670 Batch suggest in SVC backend
#677 Batch suggest in ensemble backends
#671 Add log message indicating finishing projects initialization
#673 Suppress duplicate log messages from subject module
Maintenance:
#668 Migrate codestyle to Black v23
#679/#680 Switch default git branch to main
#672 Fix slow CI/CD runs for Python 3.10
#675 Refactor and cleanup CLI module
#682/#685 Schemathesis tests for REST API and OpenAPI schema fixes
#683 Update dependencies v0.61
#691 Upgrade Docker image to Python 3.10
Annif 0.60
This release includes improvements and maintenance updates in particular to the Web UI and REST API as well as some new functionality, especially related to multilingual support. The Web UI no longer relies on jQuery, as the last parts that were used were replaced with Axios. The REST API and Web UI updates are by @UnniKohonen, who has joined @NatLibFi as a trainee in the Annif & Finto development teams.
It is now possible to override the language for subject suggestion labels instead of always using the project language: when using the annif suggest
command by giving the new --language/-L
option, and when using the REST API suggest method by the new optional language
parameter.
A new resource is added to the root of the REST API (i.e. http://<annif_host>/v1/
) that gives basic information on the API (a title for the API and the version of Annif being used). Also, the REST API spec has been updated to OpenAPI 3.0. In the Web UI it is now possible to see detailed information about a project (language, backend type, modification timestamp etc.).
Multiprocessing support for Mac OS and Windows environments has been improved by supporting the 'spawn' multiprocessing mode.
The language detection is now performed with Simplemma instead of pycld3. This functionality is now installed by default instead of being an optional extra.
New code style tools Black and isort are now used to help maintaining good code quality; see CONTRIBUTING.md how they can be used and instructions to how best participate in Annif development.
Many dependendencies have been updated to their most recent versions.
Note also that we are preparing for Annif 1.0 release. For this purpose we have opened the issue #616 for discussing the expectations of backward compatibility and Semantic Versioning in releases beyond 1.0.
Backward compatibility:
- Models trained with Annif v0.59 should remain working; the warnings by SciKit-learn are harmless
- The
annif loadvoc
command has been removed, as in the previous release it was deprecated and replaced by theannif load-vocab
command.
New features:
#628/#630 Allow overriding subject label language in CLI and REST suggest operations
#637/#638 Add support for spawn multiprocessing mode
#654 Add project info to web UI
#655/#658 Add REST API root resource
Improvements:
#593/#626 Use Simplemma instead of pycld3 for language detection
#643 Add CONTRIBUTING.md file
#645 Use tailored user-agent in requests by HTTP-backend
#644/#649 Upgrade REST API spec to OpenAPI 3
#627 Upgrade joblib to 1.2.x
#642 Upgrade jQuery to version 3.6.1
#646 Use axios in web UI; remove jQuery
#648 Upgrade Bootstrap to version 5.2.2
#651 Upgrade TensorFlow to version 2.11.*
#660 Update dependencies v0.60
#647/661 Order of projects when using project configuration directory
Maintenance:
#609/#640 Use black code style
#641 Use isort to order import statements
#656 Install linting tools with Poetry in CI/CD pipeline
#624 Increase timeout of test and publish GH Actions jobs
#653 Add CodeQL workflow for GitHub code scanning
#599/#650 Avoid using pytest-flake8 plugin
#657/#662 Upgrade GitHub Actions
#636 Better set up for docker-compose
Annif 0.59
This release makes many changes to how Annif handles vocabularies.
First, the vocabularies are now multilingual: projects with different languages can share the same vocabulary by using a common vocabulary id in the project configurations. The vocabulary id should no longer include a language specifier, which has been the practice until now. The language of the labels of subject suggestions is now defined by the project's language setting, or it can be overridden in a project by giving the language code in parentheses after the vocabulary id (e.g. vocab=lcsh(en)
in a Finnish language project). These changes break the backward compatibility of existing projects and vocabularies.
The CLI command for loading a vocabulary has changed: the command is now annif load-vocab
to align with the other annif commands and its first argument is a vocabulary id instead of a project id. When loading a vocabulary from a TSV file the --language
option needs to be given to set the language. A command annif list-vocabs
is introduced for listing vocabularies. The old annif loadvoc
command still works in this release, but it has been deprecated and will be removed in the next Annif release.
The CLI commands are now documented in a page on the ReadTheDocs instead of the Annif wiki. The development installations of Annif now use Poetry for managing Python virtual environments and dependencies. There are also a few other minor changes, including an upgrade to Simplemma v0.8 series that introduced support for new languages.
Note also that we are starting to prepare for Annif 1.0 release. For this purpose we have opened the issue #616 for discussing the expectations of backward compatibility and Semantic Versioning in releases beyond 1.0.
Backward compatibility
The changes in the vocabulary functionality require reloading of previously loaded vocabularies and retraining of existing models.
New features
#559/#600 Make vocabularies multilingual
#602/#614 Implement load-vocab
and list-vocabs
commands
#603/#610 Store vocabs in AnnifRegistry so they are shared between projects
#597 Include labels without language tag and concepts without labels in vocabulary
Improvements
#617/#618 Upgrade to simplemma 0.8 and disable unnecessary cache
#595/#611 Autogenerated CLI commands documentation on ReadTheDocs
#612 Add Annif logo to ReadTheDocs sidebar
#608 Multilingual SubjectIndex backed by CSV file
#604 Refactor SubjectSuggestion to store subject_id - not uri, label, notation
Maintenance
#607 Remove language suffixes from vocabulary ids in example config
#606 Refactor SubjectSet and Document to store subject IDs instead of URIs and labels
#601/#605 Switch to Poetry for dependency management
#621 Remove curl from Docker image
#622 Remove Poetry cache from Docker image
Fixes
#613 Restore ability to use vocab language different from project language
#619 Allow use of hyphens in vocabulary IDs
#620 Make NN ensemble suggest operations silent
Annif 0.58
This release introduces a new Simplemma analyzer, support for multiple configuration files in a directory, and support for Python 3.10; support for Python 3.7 is removed.
Simplemma is a lightweight multilingual lemmatizer, which currently supports 38 languages; an analyzer based on Simplemma is now implemented as a core feature of Annif. Using multiple project configuration files is made possible by implementing support for a project configuration directory: Annif reads all files matching pattern *.cfg
and *.toml
in the directory and merges their contents. The default name of the configuration directory is projects.d
, but any directory can be selected with -p/--projects
command option or ANNIF_PROJECTS
environment variable.
Python 3.10 support is reached by updating multiple dependencies; retraining of existing projects should not be necessary. The language filtering optional feature is not yet available on Python 3.10, because of the lack of support of pycld3 for Python 3.10.
New features:
Improvements:
Maintenance:
- #594 Upgrade Simplemma to version 0.7
- #587 Update GitHub Actions
- #588 Delete .coveragerc configuration file
- #598 Pin flake8 to version 4.x to avoid pytest-flake8 breakage
Bug fixes:
- #586 Fix readthedocs documentation builds
Annif 0.57
Training of NN ensemble models can now be performed in parallel (running suggest operations simultaneously for all source projects) on multiple CPUs; this is controlled by using the --jobs
parameter of the train
command. The compatibility of Annif with DVC is improved by supporting TOML file format for configuring Annif projects. The --force
option is added to the loadvoc
command that can be used to replace an existing vocabulary instead of updating it. This release includes many small maintenance tasks for the CI/CD pipeline, e.g. migrating Docker image builds to GitHub Actions from the Drone platform.
Omikuji, TensorFlow and Connexion dependencies are upgraded to the latest available versions; retraining of projects should not be necessary.
New features:
#526/#567 Add --force option to loadvoc CLI command
Improvements:
#429/#568 Perform suggest operations in parallel using multiprocessing in nn_ensemble
#547/#560 Support TOML as a configuration file format alongside CFG/INI for DVC compatibility
Maintenance:
#570 Use fulltext corpus in MLLM tests which is much faster
#571 Docker builds on GitHub Actions CI/CD
#572 Update Dockerfile v0.57
#573 Ensure setuptools and wheel are installed & up-to-date for tests in GitHub Actions CI
#574 Avoid running duplicated tests on PRs in GitHub Actions CI
#575 Resolve some Warnings by tests
#576 Enable pip cache in GitHub Actions CI
#577 Improved Project links in PyPI page
#578 Update dependencies v0.57
#581/#582 Add tags trigger to GH Actions CI/CD workflow