Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Whisper timestamps and task/language configuration #238
Add support for Whisper timestamps and task/language configuration #238
Changes from 3 commits
ed947e1
89c4feb
44d5bdd
9a9ab49
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would there be a reason to not have this always on? If it is slower, then perhaps we can allow it to be turned off, but I would have it on by default. Also please update the examples, so we know how to match on timestamps, and so that we also specify its format (ms? s?). :)
Also, it is generally a bad practice to change the output based on an option, which I assume is the case here. This may particularly annoying once we have the type system. So we should consider either different entry-point functions or, when timestamps is false, we use bogus timestamps (maybe -1 to -1)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought the same, but it the difference is that with timestamps disabled we enforce the
<notimestamps>
token and so the model does not generate timestamps at all, so we do not "waste" model iterations. In practice it doesn't seem to make much difference though. Note that we can also addtimestamps: :word
for per-word timestamps, so making the user opt-in as needed may make more sense.start_timestamp_seconds
,end_timestamp_seconds
?It's not! I need to update the example :D It was one of the reasons for a separate serving, now it's fine to have a more whisper-specific output spec. We just allow timestamps to be
nil
. The only weird thing is that without timestamps we return:chunks
, which is a single element with nil start and end, but that should be fine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we will have
timestamps: :words
, maybe this should betimestamps: :sentences
?Should we still return the text if we are computing the chunks? It may be the that we are building the text, only to never use it. I also see the chunks and the texts are slightly different when it comes to spacing, but I assume that's easy to post-process.
What if we always returns chunks and we have a function called
BBB.Audio.chunks_to_string
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always returning chunks may help make it consistent with streams too. I am fine if you want to postpone this decision until we have streaming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not really sentences, the model outputs timestamps whenever it feels like. It could be
timestamps: :segments
, just a bit vague?I wasn't sure, but thinking about streaming I am leaning towards that. FWIW the post processing is just join + trim, so it's fine to leave this up to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:segments is good. Agreed on everything else too!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, I will remove
:text
later with streaming :)