Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Whisper model and speech-to-text serving #107

Merged
merged 29 commits into from
Jan 26, 2023
Merged

Add Whisper model and speech-to-text serving #107

merged 29 commits into from
Jan 26, 2023

Conversation

seanmor5
Copy link
Contributor

No description provided.

for sample <- padded_samples do
sample
|> Nx.transpose()
|> Nx.to_batched(1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@polvalente btw is there a reasonable way to have a batched version? Is it something we could improve with vmap?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FFT is already batched, so it's more on as_windowed to become batched so we could have a batched STFT.

With a batched STFT, it would be reasonably easy to have a batched stft_to_mel because it's basically a clever matrix product

@jonatanklosko
Copy link
Member

image

end)
end

defp ffmpeg_read_as_pcm(path, sampling_rate) do
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to have Bumblebee dependent on 3rd party binary like this? I guess in one sense it makes things much easier to work with out of the box, but on the other it's a tight assumption. Though I guess we don't explicitly require it so it's not a big deal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it does. I was talking with @jonatanklosko about a lib I'm planning with someone on the ML channel to wrap ffmpeg and load audio data into Nx tensors from either a binary or a file name.

It would be an optional sister lib to NxSignal

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the serving I think we could do with either calling said library if available or just receiving tensors directly otherwise

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seanmor5 I think we definitely should have an easy option to work with a file in this case, and as long as it relies on optional dependencies we should be good.

FWIW hf/transformers also use ffmpeg for files.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I am fine with this as long as:

  1. Avoiding it is easy (i.e. just pass a tensor)
  2. We explicitly document it
  3. We raise a nice error message if not available

@jonatanklosko
Copy link
Member

@seanmor5 @polvalente I changed the featurizer to return the input as channels-last ({batch_size, input_length, num_mel_bins}), this way we avoid transposing back and forth, plus we already use channels-last everywhere.

@jonatanklosko jonatanklosko changed the title Add whisper Add Whisper model and speech-to-text serving Jan 26, 2023
@jonatanklosko jonatanklosko merged commit 1ca6418 into main Jan 26, 2023
@jonatanklosko jonatanklosko deleted the sm-whisper branch January 26, 2023 13:50
@developertrinidad08
Copy link

image

I wanted to try this and I got the following error

** (RuntimeError) could not match the class name "WhisperForConditionalGeneration" to any of the supported models, please specify the :module and :architecture options
(bumblebee 0.1.2) lib/bumblebee.ex:262: Bumblebee.load_spec/2
(bumblebee 0.1.2) lib/bumblebee.ex:372: Bumblebee.load_model/2
(stdlib 3.17.2) erl_eval.erl:685: :erl_eval.do_apply/6
(stdlib 3.17.2) erl_eval.erl:446: :erl_eval.expr/5
(stdlib 3.17.2) erl_eval.erl:123: :erl_eval.exprs/5
(elixir 1.14.2) lib/module/parallel_checker.ex:107: Module.ParallelChecker.verify/1

could you help me? i am new with elixir

@jonatanklosko
Copy link
Member

jonatanklosko commented Feb 8, 2023

Hey @developertrinidad08, the feature is only available on main currently. Here's a notebook you can import to try it out:

# Whisper

```elixir
Mix.install([
  {:bumblebee, github: "elixir-nx/bumblebee"},
  {:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
  {:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
  {:kino, "~> 0.8.1"}
])

Nx.global_default_backend(EXLA.Backend)
```

## Example

```elixir
{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})

serving =
  Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
    max_new_tokens: 100,
    defn_options: [compiler: EXLA]
  )

audio_input = Kino.Input.audio("Audio", sampling_rate: featurizer.sampling_rate)
```

```elixir
audio = Kino.Input.read(audio_input)

audio =
  audio.data
  |> Nx.from_binary(:f32)
  |> Nx.reshape({:auto, audio.num_channels})
  |> Nx.mean(axes: [1])

Nx.Serving.run(serving, audio)
```

@developertrinidad08
Copy link

I work perfectly thank you very much @jonatanklosko

@developertrinidad08
Copy link

Do I have another query like adding a label or something to change the language to Spanish or a different language?

@jonatanklosko
Copy link
Member

@developertrinidad08 you can try this:

serving =
  Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
    max_new_tokens: 100,
    defn_options: [compiler: EXLA],
    forced_token_ids: [
      {1, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|es|>")},
      {2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
      {3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
    ]
  )

We will likely have a higher level API to set the language, but it's model/tokenizer specific, so I deferred that for now.

@dcaixinha
Copy link

@jonatanklosko do you happen to know if providing a language to whisper has any effect on the speech detection results? Or is it just to perform translation on the output? Thanks in advance 🙏

@jonatanklosko
Copy link
Member

@dcaixinha from what I saw it does. As far as I understand the language token always indicates what language the speech uses. Then <|transcribe|> transcribes in that very language, while <|translate|> transcribes it into English.

There's also a "glitch" when the speech is English and we set a different language token + <|transcribe|>, it transcribes the English speech translated into that language (ref).

@wadestuart
Copy link

wadestuart commented Mar 11, 2023

Hello, taking a look at this I am trying to wrap my head around how to inject initial_prompt type functionality (from the whisper.py example app) -- basically it allows you to inject a string that gets tokenized to extend the initial window tokens in the model to give hints on things that may exist in the input audio. A primary use for instance is to inject proper names that may be in the input audio (so that the model is more likely to output the right proper name vs a sound-alike text "Jonatan Kłosko" vs "Jonathan Costco" ). I am just not seeing a good way to duplicate this functionality by extending this.

https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L194

@jonatanklosko
Copy link
Member

@wadestuart the serving currently works on a single window, we will extend it to chunk longer inputs in the future. I'm not sure if there's an easy way to inject the prompt (other than :forced_token_ids, though it has a different purpose). I don't think hf/transformers have this option either, but we can think about it once we support multiple windows. That said, there's always an option to write the serving yourself for full control (just like the app is implemented), though we may still need changes to the generation API to handle prompt injection.

@wadestuart
Copy link

@jonatanklosko Thank you! I will probably hold out to see how the multiple windows implementation nets out and use a port to the python implementation for the time being and revisit at that time.

@dcaixinha
Copy link

Hi @jonatanklosko, sorry for necro-bumping this thread, but was wondering if there's any option of passing the forced_token_ids you suggested above at run-time. Your suggestion works fine if the serving will always serve the same language (which is set when calling Bumblebee.Audio.speech_to_text), but for dynamic languages would be great to be able to pass the language when calling Nx.Serving.run. I was reading the docs for Nx.Serving but didn't find anything useful. Do you know if it's possible? 🙏 thank you very much!

@jonatanklosko
Copy link
Member

Hey @dcaixinha! Currently it's not really feasible since we use forced_token_ids to generate the computation graph that is compiled. While technically it should be possible to pass the language token as an input, it's too model-specific to handle reasonably I think.

That said, you can configure Whisper with no specific language and it may still return the expected transcription, which depending on what you're doing may do the job. I mean this:

forced_token_ids: [
  {2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
  {3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
]

@dcaixinha
Copy link

Gotcha, thank you very much @jonatanklosko 🙌 since in the Python library they do it at run-time, I was wondering if the same would be possible in Elixir 💭 thank you very much for your help 🙇

@jonatanklosko
Copy link
Member

@dcaixinha yeah, the difference is that in PyTorch the computation is eager and everything can be dynamic, while we rely on defn to build a computation graph that we compile together. As said, having language configured at runtime is doable, I'm just not sure how it fits into the API yet. I added a note in #187 to consider that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants