-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error by Voikko analyzer: ValueError: character U+fb50067 is not in range [U+0000; U+10ffff]
#737
Comments
What other character codes besides |
The other character code was |
Both |
A different but possibly related error, which resulted in restarting Annif by itself, occurred at Finto AI (not test instance):
|
Now the ValueError was triggered at test instance ai.dev.finto.fi, after about one day of continuous suggest requests using fuzzied text contents of Jyväskylä theses. At 10:57 logs have these lines and then Annif restarts (oldest line at the bottom):
I remember seeing Then just after Annif restarted:
So "eroa" was the word for which the error occurred. After this the error occurred for many more words (following are "niitä", "tukea"), could be for any word that is not in LRU cache. Edit: More logs Annif-logs-2023-11-07 12 13 40.txt |
After manually restarting Annif the words that previously triggered the error no longer do so. |
From the above error log, apparently this was the text to tokenize:
What's the Unicode character near the end? |
That seems to be U+10A6BF, which is above valid range, but not the one which the error message mentions (U+3e3ffe4e). Inputting or the full above text via the web UI does not re-trigger the error. I start to suspect the error has something to do in how Annif is started. Anyway, I switched to send suggest request for Jyväskylä theses, but not fuzzing them. |
My hunch is that at the point this error happens, something has already gone wrong somewhere, likely inside Voikko. Maybe there is some internal cache or other data structure that has been corrupted? The word that triggers the error (here Here is a somewhat similar situation involving locale-aware collation. In that case it's the BSD / OS X standard library that produces a string with illegal Unicode characters, and Python cannot handle them. |
The just occurred restart of Annif container (the pod remained running) is possibly related to this issue. The logs show
komodor.com tells about this exit code:
|
Once again there was a sudden restart of Annif container of Finto AI, this time with a bit different error message:
Edit: And the restart occurred about 4 hours after the previous boot (for the update to v1.0.1). |
Many months passed without problems, but now (after ~5 hours since cluster reboot) the Annif container crashed with a new message:
|
Again about 12 hours after cluster reboot:
|
Word normalization by Voikko has failed at least on two occasions when using Annif v1.0 at Finto AI, which has resulted in 500 error response for the suggest request. On the occasions the error has been reproducible by sending suggest request for just the triggering word. Once noticed, the situation has been fixed by restarting Annif, after which the word has no longer produced the error.
The logs have show the following error (the character code has been different ):
(Click to expand)
I'll make a PR to help to analyze the bug in more detail.
The text was updated successfully, but these errors were encountered: