Whisper transcribing music sounds #1101

aaa3334 · 2024-11-09T04:36:52Z

I am wondering if people preprocess their audio before sending to whisper? I have sections where there could be music - but whisper seems to just assign different words (like If If If .....) with high probability (am using mlx-community/whisper-large-v3-turbo right now).
Do people 'clean' music out of the clips first? Or have a separate method to determine speaking or not? Or are there specific settings (i don't know about) which would do this for me?

How accurate do people find the timestamps from this version too? I can see people have issues with timestamp accuracy from other whisper large v3 models, but in the limited tests I have done so far it looks pretty decent (but might just be luck or the specific example)

My settings are:
output = mlx_whisper.transcribe(audio_file,path_or_hf_repo="mlx-community/whisper-large-v3-turbo",language="en",fp16=True, word_timestamps=True)

awni · 2024-11-12T16:01:17Z

I don't know how common it is to run a separate VAD with Whisper. It's probably a good idea in general because VAD can be a lot less expensive. But you can also try tuning some transcription parameters to see if that helps:

     no_speech_threshold: float
       If the no_speech probability is higher than this value AND the average log probability
        over sampled tokens is below `logprob_threshold`, consider the segment as silent

And:

     hallucination_silence_threshold: Optional[float]
        When word_timestamps is True, skip silent periods longer than this threshold (in seconds)
        when a possible hallucination is detected

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Whisper transcribing music sounds #1101

Whisper transcribing music sounds #1101

aaa3334 commented Nov 9, 2024

awni commented Nov 12, 2024

Whisper transcribing music sounds #1101

Whisper transcribing music sounds #1101

Comments

aaa3334 commented Nov 9, 2024

awni commented Nov 12, 2024