Skip to content

Commit

Permalink
Add a more specific error when tokenizer returns zero tokens
Browse files Browse the repository at this point in the history
  • Loading branch information
jonatanklosko committed May 21, 2024
1 parent cb26c2d commit 3a59761
Showing 1 changed file with 10 additions and 2 deletions.
12 changes: 10 additions & 2 deletions lib/bumblebee/text/pre_trained_tokenizer.ex
Original file line number Diff line number Diff line change
Expand Up @@ -392,8 +392,16 @@ defmodule Bumblebee.Text.PreTrainedTokenizer do
end

defp u32_binaries_to_tensor(list) do
list
|> IO.iodata_to_binary()
binary = IO.iodata_to_binary(list)

if binary == <<>> do
raise ArgumentError,
"the tokenizer returned zero tokens. Depending on the tokenizer," <>
" this may happen for blank input. You should check if the input is blank" <>
" before attempting tokenization"
end

binary
|> Nx.from_binary(:u32)
|> Nx.reshape({length(list), :auto})
end
Expand Down

0 comments on commit 3a59761

Please sign in to comment.