Bug: Tokenizer not working on partial UTF-8 bytes #8691
Labels
bug-unconfirmed
medium severity
Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
stale
What happened?
Note: Discovered by one of the users of Guidance.
If a multibyte UTF-8 character is encoded to two tokens, LlamaCpp is unable to tokenise the byte representation of one of the tokens.
To see this:
So the single character '歪' has encoded to two tokens, 15722 and 103. A little further investigation (via the Python interfaces) revealed that 15722 maps to the bytes
\xe6\xad
. However, if we try running this throughllama-tokenize
:which is not particularly helpful (and breaks Guidance). Rather than the C++ exception, we were expecting the output
15722 -> '�'
for the reduced tokenisation request.This has been tested on Linux and Windows.
For reference, the original issue filed on Guidance:
guidance-ai/guidance#934
Name and Version
./llama-cli --version
version: 3460 (ed67bcb)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu
What operating system are you seeing the problem on?
Linux, Windows
Relevant log output
No response
The text was updated successfully, but these errors were encountered: