-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Whisper model and speech-to-text serving #107
Conversation
for sample <- padded_samples do | ||
sample | ||
|> Nx.transpose() | ||
|> Nx.to_batched(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@polvalente btw is there a reasonable way to have a batched version? Is it something we could improve with vmap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FFT is already batched, so it's more on as_windowed to become batched so we could have a batched STFT.
With a batched STFT, it would be reasonably easy to have a batched stft_to_mel because it's basically a clever matrix product
37e9258
to
92fc725
Compare
end) | ||
end | ||
|
||
defp ffmpeg_read_as_pcm(path, sampling_rate) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to have Bumblebee dependent on 3rd party binary like this? I guess in one sense it makes things much easier to work with out of the box, but on the other it's a tight assumption. Though I guess we don't explicitly require it so it's not a big deal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it does. I was talking with @jonatanklosko about a lib I'm planning with someone on the ML channel to wrap ffmpeg and load audio data into Nx tensors from either a binary or a file name.
It would be an optional sister lib to NxSignal
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the serving I think we could do with either calling said library if available or just receiving tensors directly otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seanmor5 I think we definitely should have an easy option to work with a file in this case, and as long as it relies on optional dependencies we should be good.
FWIW hf/transformers also use ffmpeg for files.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I am fine with this as long as:
- Avoiding it is easy (i.e. just pass a tensor)
- We explicitly document it
- We raise a nice error message if not available
@seanmor5 @polvalente I changed the featurizer to return the input as channels-last ( |
Hey @developertrinidad08, the feature is only available on # Whisper
```elixir
Mix.install([
{:bumblebee, github: "elixir-nx/bumblebee"},
{:nx, github: "elixir-nx/nx", sparse: "nx", override: true},
{:exla, github: "elixir-nx/nx", sparse: "exla", override: true},
{:kino, "~> 0.8.1"}
])
Nx.global_default_backend(EXLA.Backend)
```
## Example
```elixir
{:ok, whisper} = Bumblebee.load_model({:hf, "openai/whisper-tiny"})
{:ok, featurizer} = Bumblebee.load_featurizer({:hf, "openai/whisper-tiny"})
{:ok, tokenizer} = Bumblebee.load_tokenizer({:hf, "openai/whisper-tiny"})
serving =
Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
max_new_tokens: 100,
defn_options: [compiler: EXLA]
)
audio_input = Kino.Input.audio("Audio", sampling_rate: featurizer.sampling_rate)
```
```elixir
audio = Kino.Input.read(audio_input)
audio =
audio.data
|> Nx.from_binary(:f32)
|> Nx.reshape({:auto, audio.num_channels})
|> Nx.mean(axes: [1])
Nx.Serving.run(serving, audio)
``` |
I work perfectly thank you very much @jonatanklosko |
Do I have another query like adding a label or something to change the language to Spanish or a different language? |
@developertrinidad08 you can try this: serving =
Bumblebee.Audio.speech_to_text(whisper, featurizer, tokenizer,
max_new_tokens: 100,
defn_options: [compiler: EXLA],
forced_token_ids: [
{1, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|es|>")},
{2, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|transcribe|>")},
{3, Bumblebee.Tokenizer.token_to_id(tokenizer, "<|notimestamps|>")}
]
) We will likely have a higher level API to set the language, but it's model/tokenizer specific, so I deferred that for now. |
@jonatanklosko do you happen to know if providing a language to |
@dcaixinha from what I saw it does. As far as I understand the language token always indicates what language the speech uses. Then There's also a "glitch" when the speech is English and we set a different language token + |
Hello, taking a look at this I am trying to wrap my head around how to inject initial_prompt type functionality (from the whisper.py example app) -- basically it allows you to inject a string that gets tokenized to extend the initial window tokens in the model to give hints on things that may exist in the input audio. A primary use for instance is to inject proper names that may be in the input audio (so that the model is more likely to output the right proper name vs a sound-alike text "Jonatan Kłosko" vs "Jonathan Costco" ). I am just not seeing a good way to duplicate this functionality by extending this. https://github.com/openai/whisper/blob/main/whisper/transcribe.py#L194 |
@wadestuart the serving currently works on a single window, we will extend it to chunk longer inputs in the future. I'm not sure if there's an easy way to inject the prompt (other than |
@jonatanklosko Thank you! I will probably hold out to see how the multiple windows implementation nets out and use a port to the python implementation for the time being and revisit at that time. |
Hi @jonatanklosko, sorry for necro-bumping this thread, but was wondering if there's any option of passing the |
Hey @dcaixinha! Currently it's not really feasible since we use That said, you can configure Whisper with no specific language and it may still return the expected transcription, which depending on what you're doing may do the job. I mean this:
|
Gotcha, thank you very much @jonatanklosko 🙌 since in the Python library they do it at run-time, I was wondering if the same would be possible in Elixir 💭 thank you very much for your help 🙇 |
@dcaixinha yeah, the difference is that in PyTorch the computation is eager and everything can be dynamic, while we rely on |
No description provided.