Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ComfyUI/Flux memory utilization when loading model ? #4318

Open
snoil411 opened this issue Aug 11, 2024 · 19 comments
Open

ComfyUI/Flux memory utilization when loading model ? #4318

snoil411 opened this issue Aug 11, 2024 · 19 comments
Labels
User Support A user needs help with something, probably not a bug.

Comments

@snoil411
Copy link

Your question

First time ComfyUI user coming from Automatic1111. Ive had no issues using SD, SDXL and SD3 with CcomfyUI but haven't managed to get Flux working due to memory issues. Ive read a lot of people having similar issues but am confused about the following.

I have 32G RAM and 16G VRAM. (AMD card)
I started with flux1Dev_v10.safetensors with t5xxl_fp16.safetensors as I read many people were successful with the same hardware I have.

The model sits there 'loading' for roughly 5 minutes during that time RAM completely fills to 99% and after that HDD utilization sits at 100% until it loads. After it loads clips and vae, during the sampling spits out the 'not enough VRAM error' that ive seen many people get but so far don't see a solution for.

I tried using the schnell model instead + t5xxl_fp8_e4m3fn (i.e half the size) but get the same thing. It takes 500 seconds+ to load the 11Gig model and my RAM usage goes to 100% and HDD sits again at 100% utilization. Then im told again not enough VRAM. QUESTION: why does the schnell model utilize so much RAM ? by comparison I load SD3 (stableDiffusion3SD3_sd3MediumInclT5XXL), which is roughly the same size at 10.5G in less than a minute and it renders fine ?

Cheers....

Logs

No response

Other

No response

@snoil411 snoil411 added the User Support A user needs help with something, probably not a bug. label Aug 11, 2024
@ltdrdata
Copy link
Collaborator

FLUX fp8 model size is 11GB without T5.
FLUX fp16 model size is 24gb without T5.

And the T5 fp8 is 4GB+. fp16 is 9GB+.

Make sure you are using fp8.

@JorgeR81
Copy link

JorgeR81 commented Aug 12, 2024

I have 32 GB RAM and I can use Flux.
But I also run out of RAM, when the model is loading, so my system needs to use the page file, leading to SSD activity.

When VRAM is full, the system offloads to RAM.
And when RAM is full, it offloads to the page file ( in your SSD ).

Maybe you need to change some settings on your page file.
#4226

Even the fp8 version of Flux needs more than 32 GB, because it's upcasted to fp32, during loading, and then back down to fp8.
#4239

Recently there was a commit that adds support for Flux to be upcasted to fp16, instead.
But it still needs more than 32 GB ( nor sure why ? )
( and it also slows down my inference speeds, unfortunately )
8115d8c

Also, NF4 support was added today to Comfy UI, via custom node, so we can use even smaller NF4 model versions.
I haven't tried yet, but it seems it still needs the same amount for VRAM.
comfyanonymous/ComfyUI_bitsandbytes_NF4#9 (comment)

But I think support for these formats would probably be optimized in the future, so that we'll need less system resources to use Flux.

@kakachiex2
Copy link

I have 6gb of V-Ram RTX260 and I can use Flux but at the cost of waiting 1hour for the model to load but this behavior start when I updated ComfyUI before it takes like 20 to 30 minutes to load.

@ltdrdata
Copy link
Collaborator

ltdrdata commented Aug 12, 2024

I have 32 GB RAM and I can use Flux. But I also run out of RAM, when the model is loading, so my system needs to use the page file, leading to SSD activity.

When VRAM is full, the system offloads to RAM. And when RAM is full, it offloads to the page file ( in your SSD ).

Maybe you need to change some settings on your page file. #4226

Even the fp8 version of Flux needs more than 32 GB, because it's upcasted to fp32, during loading, and then back down to fp8. #4239

Recently there was a commit that adds support for Flux to be upcasted to fp16, instead. But it still needs more than 32 GB ( nor sure why ? ) ( and it also slows down my inference speeds, unfortunately ) 8115d8c

Also, NF4 support was added today to Comfy UI, via custom node, so we can use even smaller NF4 model versions. I haven't tried yet, but it seems it still needs the same amount for VRAM. comfyanonymous/ComfyUI_bitsandbytes_NF4#9 (comment)

But I think support for these formats would probably be optimized in the future, so that we'll need less system resources to use Flux.

The issue is said to be a limitation of the safetensors library.
To address this, an update has been made to load the TextEncoder into VRAM when there is sufficient VRAM available.
This is expected to alleviate the temporary RAM shortage that causes swapping.
Try updating ComfyUI and testing it once.

5c69cde

@JorgeR81
Copy link

JorgeR81 commented Aug 12, 2024

The issue is said to be a limitation of the safetensors library.
To address this, an update has been made to load the TextEncoder into VRAM when there is sufficient VRAM available.
This is expected to alleviate the temporary RAM shortage that causes swapping.
Try updating ComfyUI and testing it once.

5c69cde

Still the same.
The page file is still needed when the model is loading, ( tried the fp8 checkpoint version ).
And about the same loading and sampling ( sec/it ) times.

I never ran out of RAM, when changing the prompt, when using the fp8 models.
I only had that issue with the fp16 models.

Also, when using the UNET loader, in fp8 mode, I run out of RAM, even before starting to load the t5 encoder.

@JorgeR81
Copy link

I have 6gb of V-Ram RTX260 and I can use Flux but at the cost of waiting 1hour for the model to load but this behavior start when I updated ComfyUI before it takes like 20 to 30 minutes to load.

You have some sort or issue here.

My loading time is about 1 minute.
I have a SATA SSD.

@snoil411
Copy link
Author

FLUX fp8 model size is 11GB without T5. FLUX fp16 model size is 24gb without T5.
And the T5 fp8 is 4GB+. fp16 is 9GB+.
Make sure you are using fp8.

As said in OP I first tried fp16 then fp8, but as someone else said below fp8 is upscaled to fp32 during loading so I assume this is what causes my issue.

Even the fp8 version of Flux needs more than 32 GB, because it's upcasted to fp32, during loading, and then back down to fp8.
#4239

I didnt know this so Im guessing this is the issue for the RAM filling up.

My other question then is what could cause the VRAM issue/s (or how/what can I do to identify the issue) ? after it is downscaled back to fp8 and the clip models and vae are loaded I still get 'not enough VRAM' issues. (I have 16GB). The guy above me has only 6GB (6gb of V-Ram RTX260) compared to my 16GB and is able to use FLUX after the model finally loads.

@JorgeR81
Copy link

My other question then is what could cause the VRAM issue/s (or how/what can I do to identify the issue) ? after it is downscaled back to fp8 and the clip models and vae are loaded I still get 'not enough VRAM' issues. (I have 16GB). The guy above me has only 6GB (6gb of V-Ram RTX260) compared to my 16GB and is able to use FLUX after the model finally loads.

Yeah that may be a different issue.
I can also run Flux with 8GB VRAM ( GTX 1070 )
When VRAM is full, the system should offload to RAM.
And once the Flux model is done loading, it should be in fp8, so you should have RAM available.
Have you looked at your task manager when you try to generate an image?
#4226 (comment)

@snoil411
Copy link
Author

snoil411 commented Aug 12, 2024

And once the Flux model is done loading, it should be in fp8, so you should have RAM available.

According to task manager once the model is loaded RAM stays full ? despite being in fp8 ?

When VRAM is full, the system should offload to RAM.

Id assume thats my problem then ? my RAM remains full even though there should be some available after being downscaled to fp8. VRAM can now not be offloaded to RAM ?

Still leaves the question as to why RAM remains full after model is loaded and downscaled ?

Out of curiosity, how much VRAM should the Schnell and Dev models occupy respectively if you had unlimited VRAM to play around with ?

EDIT** I was thinking of trying another large non flux model to see if I could get same results/errors. Im only familiar with SD and the biggest is SD3 (works flawlessly). Are there any other LARGE 20gig+models that are likely to fill up my RAM and VRAM ?

comfymem

@JorgeR81
Copy link

Id assume thats my problem then ? my RAM remains full even though there should be some available after being downscaled to fp8. VRAM can now not be offloaded to RAM ?

Yeah, I also think that's the problem.
My RAM usage drops to about 20GB, after the fp8 model is done loading.

Out of curiosity, how much VRAM should the Schnell and Dev models occupy respectively if you had unlimited VRAM to play around with ?

The Schnell and Dev models are the same size.
The difference is between the fp8 and fp16 versions of them.
The T5 encoder also has fp8 and fp16 versions.

FLUX fp8 model size is 11GB without T5.
FLUX fp16 model size is 24gb without T5.

And the T5 fp8 is 4GB+. fp16 is 9GB+.

@snoil411
Copy link
Author

My RAM usage drops to about 20GB, after the fp8 model is done loading.

what is your VRAM usage when this happens ? and is this with the fp16 or fp8 models ?

Just trying to sum up the requirements and to see how it varies person to person
IF anyone else wants to include their RAM+VRAM usage once all the models are loaded and image is generating please do.

I was grasping at straws and though I doubt (have no idea :) ) if it will have an effect on memory I thought id use RocM instead of Directml but Pytorch+RocM 6 is only available on Linux so im off to do an Ubuntu install on an empty SSD.
Even if I cant get Flux working SD speeds are supposed to be quite a bit faster on Linux

The Schnell and Dev models are the same size.

Yeah got a bit confused, I downloaded different fp8 schnell models from different sources. One was 11G one was 17G.

@JorgeR81
Copy link

what is your VRAM usage when this happens ? and is this with the fp16 or fp8 models ?

  • With fp8, while generating ( KSampler ), I use about 14+7 GB of RAM+VRAM. In idle, it's about 20+1 GB.

  • With fp16, while generating, I use about 20+7 GB of RAM+VRAM. In idle, it's about 26+1 GB.

So VRAM usage is the same with fp8 or fp16
I think the VRAM is filled as much as possible, during generation ( 8 GB in my case ), and the rest goes to RAM.
Ideally the whole model should fit in VRAM.
The more you need to have in RAM, the slower the generation would be.

so im off to do an Ubuntu install on an empty SSD

Are you on Linux or Windows, right now ?
If you can try Comfy UI on Linux, you probably should.
I have read here that Comfy UI has better support for AMD GPU's on Linux. 

One was 11G one was 17G

These are both fp8.
The 11G uses the Unet loader node, and the t5 encoder is loaded from another node. 
The 17G includes the t5 encoder and uses the default checkpoint loader node. 

The workflows are in the example images:
https://comfyanonymous.github.io/ComfyUI_examples/flux/

@snoil411
Copy link
Author

Are you on Linux or Windows, right now ?
If you can try Comfy UI on Linux, you probably should.
I have read here that Comfy UI has better support for AMD GPU's on Linux.

All my above issues were on Windows but had it setup on Linux in about 30 minutes this time with RocM and Pytorch. Had SOME unrelated issues getting RocM to work with an integrated GPU (7900XT) but was an easy fix.

fp16 flux now working. Amazing how much faster the models load. 1024*1024 image generates in less than 2 minutes including the loading of the models.

Still a mystery as to why I had/have issues in WIndows but guess ill stick to linux for AI.

Thanks for help and explanations.

@BigBanje
Copy link

My 3090 can no longer do anything with Flux, as I experience extreme bottlenecking after some recent update to comfyui.

I can't even use fp8 anymore... I was previously using Loras with fp16 + hires fix, no issue. Now I can't make a single creation at all.

Everything is updated, so I'm just going to (once again) go through the process of downgrading my comfyui until I find one that isn't broken...

image

@JorgeR81
Copy link

@BigBanje, I think you meant to post on the larger thread. 
This issue is probably related.
But @comfyanonymous, asked yesterday for users to post their system specs and exact workflow, if the issue persists.

#4271 (comment)

@ltdrdata
Copy link
Collaborator

My 3090 can no longer do anything with Flux, as I experience extreme bottlenecking after some recent update to comfyui.

I can't even use fp8 anymore... I was previously using Loras with fp16 + hires fix, no issue. Now I can't make a single creation at all.

Everything is updated, so I'm just going to (once again) go through the process of downgrading my comfyui until I find one that isn't broken...

image

If you could identify which commit started causing this issue, it would be helpful for debugging.

@jslegers
Copy link

My 3090 can no longer do anything with Flux, as I experience extreme bottlenecking after some recent update to comfyui.
I can't even use fp8 anymore... I was previously using Loras with fp16 + hires fix, no issue. Now I can't make a single creation at all.

I noticed that the UNETLoader.load_unet takes a lot more memory since the most recent changes when loading a FLUX transformer unet of weight_dtype fp8_e4m3fn.

Before the changes I could stay under 12GB total VRAM usage when loading a fp8_e4m3fn version of the flux1-schnell after first loading the t5xxl text decoder (given a minor tweak to unet_offload_device - see #4319).

After the changes, I run into the 16GB memory limit when the FLUX transformer unet is loaded.

See also #4341, #4343 & #4338

@Francklin9999
Copy link

I have a i12700h cpu with 16 gb ram and a 3070ti 8gb vram 16gb (shared memory) and I run out of memory with both my cpu and gpu. When running on my gpu I get runner out of memory and on my cpu it just gets killed. Someone knows how I can still run it so I can generate images? I don't care if it takes an hour.

@JorgeR81
Copy link

JorgeR81 commented Aug 20, 2024

I have a i12700h cpu with 16 gb ram and a 3070ti 8gb vram 16gb (shared memory) and I run out of memory with both my cpu and gpu. When running on my gpu I get runner out of memory and on my cpu it just gets killed. Someone knows how I can still run it so I can generate images? I don't care if it takes an hour.

This custom node allows you to use Flux and the t5 text encoder in smaller formats.
https://github.com/city96/ComfyUI-GGUF

They will use less RAM.
I'm able to stay below my 32 GB RAM limit.
Not sure about 16 GB RAM though.

The Q4 format is below 8GB in size.
I tried the Q4_1, but the new Q4_k should have better quality.
I haven't tried the smaller text encoder formats yet, because that has just been added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
User Support A user needs help with something, probably not a bug.
Projects
None yet
Development

No branches or pull requests

7 participants