Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Tokenizer not working on partial UTF-8 bytes #8691

Closed
riedgar-ms opened this issue Jul 25, 2024 · 2 comments
Closed

Bug: Tokenizer not working on partial UTF-8 bytes #8691

riedgar-ms opened this issue Jul 25, 2024 · 2 comments
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale

Comments

@riedgar-ms
Copy link

What happened?

Note: Discovered by one of the users of Guidance.

If a multibyte UTF-8 character is encoded to two tokens, LlamaCpp is unable to tokenise the byte representation of one of the tokens.

To see this:

printf '\xe6\xad\xaa'printf '\xe6\xad\xaa' | ./llama-tokenize -m ~/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-IQ3_S.gguf --stdin --no-bos
<snip>
 15722 -> ''
   103 -> ''

So the single character '歪' has encoded to two tokens, 15722 and 103. A little further investigation (via the Python interfaces) revealed that 15722 maps to the bytes \xe6\xad. However, if we try running this through llama-tokenize:

 printf '\xe6\xad' | ./llama-tokenize -m ~/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-IQ3_S.gguf --stdin --no-bos
<snip>
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted (core dumped)

which is not particularly helpful (and breaks Guidance). Rather than the C++ exception, we were expecting the output 15722 -> '�' for the reduced tokenisation request.

This has been tested on Linux and Windows.

For reference, the original issue filed on Guidance:
guidance-ai/guidance#934

Name and Version

./llama-cli --version
version: 3460 (ed67bcb)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux, Windows

Relevant log output

No response

@riedgar-ms riedgar-ms added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 25, 2024
@riedgar-ms riedgar-ms reopened this Jul 25, 2024
@riedgar-ms
Copy link
Author

Any update on this?

@github-actions github-actions bot added the stale label Sep 15, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) stale
Projects
None yet
Development

No branches or pull requests

1 participant