Memory optimizations to allow bigger images #2015
Closed
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I realized that my hardware could generate bigger images than I was being able to generate, with those changes I went from the limit of 1024x1024 to 2184x2184 on my 2GB VRAM GPU, that is ~4x bigger.
Now that we have the LCM and we only need 5 steps, it is reasonable to generate bigger images and the limitation was just the OOM errors.
List of changes:
TAESD PREVIEW - The TAESD preview uses the GPU by default and it gets OOM stopping the generation even when there is enough memory for the generation process, the first change was to ignore the OOM of the preview and just don't show the preview if it gets a OOM, the second change was the command line option
--preview-cpu
to use the CPU for the preview, so you can get TAESD previews even with low VRAM and big sizes.SPLIT ATTENTION - The split attention allows to have bigger images but it was limiting the steps by a fixed amount that was not enough in many cases, another issue is that the steps were powers of 2, but the steps must be a divider of the
q.shape[1]
, when generating images with sizes like 512x512, 1024x1024, 2048x2048 it is ok to have the steps been a power of 2 but for other sizes the dividers of theq.shape[1]
where not a power of two, making it skip the correct divider and getting the OOM even when it could be using another number of steps, and in other situations the estimated memory requirement is bigger than the correct divider, skipping it. So to solve it I added the command line option--memory-estimation-multiplier
so we can adjust the value of themodifier
that already exists to fit our need and even disable the estimation by setting it to zero, the other change was to compute the next step by finding the next divider instead of multiplying by 2. With these changes it is unlikely someone will get a OOM again in this part of the code if they adjust the memory multiplier properly when needed, based on my tests you will get a OOM error in other parts of the code before, in places that cannot be split anyway.VAE ENCODE/DECODE - The last thing limiting the generation was the VAE encoding/decoding, it already has a tiled option as fallback in case of OOM but it runs with just the default tile size, but it often gets OOM even when a smaller tile size would solve the problem, so now it tries first with the default tile size as before but in case of OOM in the tiled process, it tries a smaller tile size. I set the limit as the smaller value possible, but it is unlikely it will get there unless someone is generating an extreme batch size for their hardware, but if it gets there it will work anyway and even this lowest tile size has a reasonable quality for this extreme tiling, which is better than a OOM that makes you lose all images.
The default values keep the same behavior as before, so for those that don't have issues with OOM it will work in the same way.