Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

T5 tokenizer decoding error with CodeT5+ #1021

Open
zcbenz opened this issue Oct 9, 2024 · 1 comment
Open

T5 tokenizer decoding error with CodeT5+ #1021

zcbenz opened this issue Oct 9, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@zcbenz
Copy link
Contributor

zcbenz commented Oct 9, 2024

$ python3 convert.py --model codet5p-220m
$ python3 t5.py --model codet5p-220m --prompt 'def print_hello_world():<extra_id_0>' --max-tokens 10
[INFO] Generating with T5...
Input:  def print_hello_world():<extra_id_0>
<extra_id_0>ĊĠĠĠĠprintĠ"HelloĠWorld"ĊĊ

The hf_t5.py can do correct output with changes:

diff --git a/t5/hf_t5.py b/t5/hf_t5.py
index 98c6da8..23d9644 100644
--- a/t5/hf_t5.py
+++ b/t5/hf_t5.py
@@ -23,11 +23,11 @@ def embed(t5_model: str):
 
 
 def generate(t5_model: str):
-    prompt = "translate English to German: As much as six inches of rain could fall in the New York City region through Monday morning, and officials warned of flooding along the coast."
+    prompt = "def print_hello_world():<extra_id_0>"
     tokenizer = AutoTokenizer.from_pretrained(t5_model)
     torch_model = AutoModelForSeq2SeqLM.from_pretrained(t5_model)
     torch_tokens = tokenizer(prompt, return_tensors="pt", padding=True).input_ids
-    outputs = torch_model.generate(torch_tokens, do_sample=False, max_length=512)
+    outputs = torch_model.generate(torch_tokens, do_sample=False, max_length=10)
     print(tokenizer.decode(outputs[0], skip_special_tokens=True))
$ python3 hf_t5.py --model codet5p-220m
    print "Hello World"

It seems that the tokenizer does not work well with streaming decoding.

@awni
Copy link
Member

awni commented Oct 10, 2024

Thanks for flagging. Indeed the way we do streaming decode in the T5 example is not correct for most tokenizers (you can't typically decode each new token individually as we do here). It should either be a proper streaming decoder or we just eat the quadratic cost and redecode the entire prefix.

Will mark this as a bug, should be a fairly simple fix.

@awni awni added the bug Something isn't working label Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants