How to configure a text-to-speech model forced_token_ids? #210

rhcarvalho · 2023-05-19T08:10:16Z

Thanks for Bumblebee and the provided examples! I'm trying out the example at https://github.com/elixir-nx/bumblebee/blob/main/examples/phoenix/speech_to_text.exs.

It works well for audio input in English. For audio input in other languages, it seems to be automatically translating the output to English.

I read https://huggingface.co/openai/whisper-tiny#usage, and, if I understood it well, I'd need to use forced_token_ids to specify the desired/input language and task to be transcribe and not translate. Like in:

processor = WhisperProcessor.from_pretrained("openai/whisper-tiny")
model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-tiny")
forced_decoder_ids = processor.get_decoder_prompt_ids(language="french", task="transcribe")  ## <<<<<

# ...

predicted_ids = model.generate(input_features, forced_decoder_ids=forced_decoder_ids)
transcription = processor.batch_decode(predicted_ids)
['<|startoftranscript|><|fr|><|transcribe|><|notimestamps|> Un vrai travail intéressant va enfin être mené sur ce sujet.<|endoftext|>']
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
[' Un vrai travail intéressant va enfin être mené sur ce sujet.']

How to do that with Bumblebee?

The text was updated successfully, but these errors were encountered:

josevalim · 2023-05-19T08:38:23Z

Not possible yet, see #187. :)

jonatanklosko · 2023-05-19T09:44:30Z

@rhcarvalho you can customize forced_token_ids see #107 (comment), we want to streamline this in #187 with higher-level options :)

rhcarvalho · 2023-05-22T07:48:04Z

@jonatanklosko 👏 thanks for the pointer! I think the argument types have changed since then as the original example in the comment throws an error. This is what worked for me, in case someone ends up checking this issue for a solution:

diff --git examples/phoenix/speech_to_text.exs examples/phoenix/speech_to_text.exs
index 99f72cb..94e8989 100644
--- examples/phoenix/speech_to_text.exs
+++ examples/phoenix/speech_to_text.exs
@@ -314,6 +314,15 @@ end
 {:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
 {:ok, generation_config} = Bumblebee.load_generation_config({:hf, "openai/whisper-tiny"})

+generation_config = %{
+  generation_config
+  | forced_token_ids: [
+      {1, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|pt|>")},
+      {2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
+      {3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
+    ]
+}
+
 serving =
   Bumblebee.Audio.speech_to_text(model_info, featurizer, tokenizer, generation_config,
     compile: [batch_size: 10],

josevalim closed this as completed May 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to configure a text-to-speech model forced_token_ids? #210

How to configure a text-to-speech model forced_token_ids? #210

rhcarvalho commented May 19, 2023

josevalim commented May 19, 2023

jonatanklosko commented May 19, 2023

rhcarvalho commented May 22, 2023

How to configure a text-to-speech model forced_token_ids? #210

How to configure a text-to-speech model forced_token_ids? #210

Comments

rhcarvalho commented May 19, 2023

josevalim commented May 19, 2023

jonatanklosko commented May 19, 2023

rhcarvalho commented May 22, 2023