Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Whisper transcribing music sounds #1101

Open
aaa3334 opened this issue Nov 9, 2024 · 1 comment
Open

Whisper transcribing music sounds #1101

aaa3334 opened this issue Nov 9, 2024 · 1 comment

Comments

@aaa3334
Copy link

aaa3334 commented Nov 9, 2024

I am wondering if people preprocess their audio before sending to whisper? I have sections where there could be music - but whisper seems to just assign different words (like If If If .....) with high probability (am using mlx-community/whisper-large-v3-turbo right now).
Do people 'clean' music out of the clips first? Or have a separate method to determine speaking or not? Or are there specific settings (i don't know about) which would do this for me?

How accurate do people find the timestamps from this version too? I can see people have issues with timestamp accuracy from other whisper large v3 models, but in the limited tests I have done so far it looks pretty decent (but might just be luck or the specific example)

My settings are:
output = mlx_whisper.transcribe(audio_file,path_or_hf_repo="mlx-community/whisper-large-v3-turbo",language="en",fp16=True, word_timestamps=True)

@awni
Copy link
Member

awni commented Nov 12, 2024

I don't know how common it is to run a separate VAD with Whisper. It's probably a good idea in general because VAD can be a lot less expensive. But you can also try tuning some transcription parameters to see if that helps:

     no_speech_threshold: float
       If the no_speech probability is higher than this value AND the average log probability
        over sampled tokens is below `logprob_threshold`, consider the segment as silent    

And:

     hallucination_silence_threshold: Optional[float]
        When word_timestamps is True, skip silent periods longer than this threshold (in seconds)
        when a possible hallucination is detected 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants