You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to prepare a dataset for whisper fine-tuning , and I have a lot of small segment clip , most of them less than 6 seconds, I read the paper, but didn’t understand this paragraph:
“ When a final transcript segment is only partially included in the current 30- second audio chunk, we predict only its start time token for the segment when in timestamp mode, to indicate that the subsequent decoding should be performed on an audio window aligned with that time, otherwise we truncate the audio to not include the segment”
So when should I add the final segment if it is partially included in the current 30-second chunk, and when should I truncate the chunk without it, and if I added it how to extract only relevant transcription?
I am trying to prepare a dataset for whisper fine-tuning , and I have a lot of small segment clip , most of them less than 6 seconds, I read the paper, but didn’t understand this paragraph:
“ When a final transcript segment is only partially included in the current 30- second audio chunk, we predict only its start time token for the segment when in timestamp mode, to indicate that the subsequent decoding should be performed on an audio window aligned with that time, otherwise we truncate the audio to not include the segment”
So when should I add the final segment if it is partially included in the current 30-second chunk, and when should I truncate the chunk without it, and if I added it how to extract only relevant transcription?
To make it clear:
assume that every window is 30 seconds, how to get the correct relevant transcription of the partially included segments?
Anyone could help?
The text was updated successfully, but these errors were encountered: