-
Notifications
You must be signed in to change notification settings - Fork 975
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: grouping audio frames causes memory problems and is maybe slightly slower? #846
base: master
Are you sure you want to change the base?
Conversation
I am unsure if the gc call is still necessary. I'll let you decide |
@IvanovCosmin , hello. I believe that memory will not increase over time in the _group_frames function because memory is released after calling fifo.read(). def _group_frames(frames, num_samples=None):
fifo = av.audio.fifo.AudioFifo()
for frame in frames:
frame.pts = None # Ignore timestamp check.
fifo.write(frame)
if num_samples is not None and fifo.samples >= num_samples:
yield fifo.read()
if fifo.samples > 0:
yield fifo.read() Additionally, I think that grouping frames into a consistent size (500,000) helps standardize the processing pipeline, making the next steps (resampling, ...) more stable and avoiding errors with long audio inputs |
It does not increase over time, it just increases too much for too long. I think that AV does something weird here. I follow your judgement, but this is not what happens in practice and it does not work well in production envs.
|
From my test with device cuda and model large-v3, memory increase is listed below:
Log for audio3.m4a:
It can be seen that the memory increase might depend on the audio quality (audio with both sound and image will have a higher memory increase), and the memory increase for long videos is negligible. Besides, in your example, if don't use _group_frames, it will take 182 loops to process all the frames, whereas using that function would only take 3 loops.
It doesn't really seem faster. |
In our testing we found that this grouping allocates around 100 bytes per audio sample even for incredibly small videos.
For example an 8 second 1.36 MB videos allocates 18.9MB (I think it should be around 33, unsure why it is only halfl)
It doesn't require much imagination to compute what would happen on really big videos.
This will trigger memory limits pretty constantly on production deployents, and single handedly makes faster-whisper unsuitable for production systems.
Before my change:
After my change:
The performance difference is this on a few hundred requests:
I won't die on the performance hill.