You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am wondering if people preprocess their audio before sending to whisper? I have sections where there could be music - but whisper seems to just assign different words (like If If If .....) with high probability (am using mlx-community/whisper-large-v3-turbo right now).
Do people 'clean' music out of the clips first? Or have a separate method to determine speaking or not? Or are there specific settings (i don't know about) which would do this for me?
How accurate do people find the timestamps from this version too? I can see people have issues with timestamp accuracy from other whisper large v3 models, but in the limited tests I have done so far it looks pretty decent (but might just be luck or the specific example)
My settings are:
output = mlx_whisper.transcribe(audio_file,path_or_hf_repo="mlx-community/whisper-large-v3-turbo",language="en",fp16=True, word_timestamps=True)
The text was updated successfully, but these errors were encountered:
I don't know how common it is to run a separate VAD with Whisper. It's probably a good idea in general because VAD can be a lot less expensive. But you can also try tuning some transcription parameters to see if that helps:
no_speech_threshold: float
If the no_speech probability is higher than this value AND the average log probability
over sampled tokens is below `logprob_threshold`, consider the segment as silent
And:
hallucination_silence_threshold: Optional[float]
When word_timestamps is True, skip silent periods longer than this threshold (in seconds)
when a possible hallucination is detected
I am wondering if people preprocess their audio before sending to whisper? I have sections where there could be music - but whisper seems to just assign different words (like If If If .....) with high probability (am using mlx-community/whisper-large-v3-turbo right now).
Do people 'clean' music out of the clips first? Or have a separate method to determine speaking or not? Or are there specific settings (i don't know about) which would do this for me?
How accurate do people find the timestamps from this version too? I can see people have issues with timestamp accuracy from other whisper large v3 models, but in the limited tests I have done so far it looks pretty decent (but might just be luck or the specific example)
My settings are:
output = mlx_whisper.transcribe(audio_file,path_or_hf_repo="mlx-community/whisper-large-v3-turbo",language="en",fp16=True, word_timestamps=True)
The text was updated successfully, but these errors were encountered: