Bug: Tokenizer not working on partial UTF-8 bytes #8691

riedgar-ms · 2024-07-25T13:19:53Z

What happened?

Note: Discovered by one of the users of Guidance.

If a multibyte UTF-8 character is encoded to two tokens, LlamaCpp is unable to tokenise the byte representation of one of the tokens.

To see this:

printf '\xe6\xad\xaa'
歪

printf '\xe6\xad\xaa' | ./llama-tokenize -m ~/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-IQ3_S.gguf --stdin --no-bos
<snip>
 15722 -> '�'
   103 -> '�'

So the single character '歪' has encoded to two tokens, 15722 and 103. A little further investigation (via the Python interfaces) revealed that 15722 maps to the bytes \xe6\xad. However, if we try running this through llama-tokenize:

 printf '\xe6\xad' | ./llama-tokenize -m ~/.cache/huggingface/hub/models--bartowski--Meta-Llama-3-8B-Instruct-GGUF/snapshots/4ebc4aa83d60a5d6f9e1e1e9272a4d6306d770c1/Meta-Llama-3-8B-Instruct-IQ3_S.gguf --stdin --no-bos
<snip>
terminate called after throwing an instance of 'std::invalid_argument'
  what():  invalid character
Aborted (core dumped)

which is not particularly helpful (and breaks Guidance). Rather than the C++ exception, we were expecting the output 15722 -> '�' for the reduced tokenisation request.

This has been tested on Linux and Windows.

For reference, the original issue filed on Guidance:
guidance-ai/guidance#934

Name and Version

./llama-cli --version
version: 3460 (ed67bcb)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux, Windows

Relevant log output

No response

The text was updated successfully, but these errors were encountered:

riedgar-ms · 2024-08-15T17:21:22Z

Any update on this?

github-actions · 2024-09-29T01:07:47Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

riedgar-ms added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Jul 25, 2024

This was referenced Jul 25, 2024

[WIP][Bug] Issue 934 guidance-ai/guidance#962

Open

LlamaCpp model crashes with multi-token characters guidance-ai/guidance#934

Open

riedgar-ms closed this as completed Jul 25, 2024

riedgar-ms reopened this Jul 25, 2024

github-actions bot added the stale label Sep 15, 2024

github-actions bot closed this as completed Sep 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: Tokenizer not working on partial UTF-8 bytes #8691

Bug: Tokenizer not working on partial UTF-8 bytes #8691

riedgar-ms commented Jul 25, 2024

riedgar-ms commented Aug 15, 2024

github-actions bot commented Sep 29, 2024

Bug: Tokenizer not working on partial UTF-8 bytes #8691

Bug: Tokenizer not working on partial UTF-8 bytes #8691

Comments

riedgar-ms commented Jul 25, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

riedgar-ms commented Aug 15, 2024

github-actions bot commented Sep 29, 2024