Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization opportunities #120

Open
Shnatsel opened this issue Oct 30, 2024 · 2 comments
Open

Parallelization opportunities #120

Shnatsel opened this issue Oct 30, 2024 · 2 comments

Comments

@Shnatsel
Copy link
Contributor

Part of the gap observed in #119 is due to webp-animation using multi-threading. libwebp also has a multi-threading option for still images. But right now image-webp is single-threaded.

I think there are some opportunities for easy wins from parallelization.

Decoding frames in parallel

Each frame is contained in its own ANMF chunk which specifies the length, so we should be able to read the raw bytes and spawn the decoding on a thread pool quite easily.

Complication: this may dramatically increase memory usage on machines with lots of cores, or if the decoding of the frames performs faster than compositing them and uncomposited frames pile up. We will need to introduce some sort of backpressure mechanism so that it doesn't happen. For example, decoding is only allowed to have at most X frames or at most Y megapixels in flight. We know all the frame dimensions so it's not too hard to enforce, but still adds some complexity to the code as opposed to just slapping .par_iter() on it.

Compositing frames in parallel

In the general case compositing is inherently sequential: it requires the previous frame to be fully composited before the next one can start. However, each row in the image can be processed independently, so we should be able to just use .par_chunks_mut() on the buffer and composite each row separately.

The profile in #119 shows that we're spending a lot of time just doing alpha blending, so this could translate to significant performance gains.

Still images?

libwebp also has optional multi-threading support for still images. I've benchmarked dwebp compared to dwebp -mt on a lossy image and only saw a 20% improvement in performance. The single-threaded bottleneck in #71 also makes it seem like we won't get much out of single-image parallelism. I don't think it's worth the trouble for static images.

@Shnatsel
Copy link
Contributor Author

Shnatsel commented Nov 5, 2024

I think I've overcomplicated parallelizing animations. Having just two threads - one for decoding, one for compositing - is going to be almost as good as it's going to get.

Since compositing is inherently sequential, decoding many frames at the same time wouldn't get us improved performance. They will be consumed one by one anyway, so if we cap the amount of frames held in memory at the same time the pipeline is going to fill up to that amount and then process one frame at a time anyway.

And compositing in parallel isn't super useful because it only needs to go as fast as decoding, and beyond that we won't get any gains because the decoding thread becomes the bottleneck.

@fintelia
Copy link
Contributor

fintelia commented Nov 5, 2024

if we cap the amount of frames held in memory at the same time the pipeline is going to fill up to that amount and then process one frame at a time anyway.

This isn't true. If decoding is 10x slower than compositing then once the pipeline fills up, we'll start by pulling one frame to composite and starting one frame to decode. But nine additional frames will be pulled from the pipeline for compositing before the first additional frame decode finishes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants