Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support FLUX series models #1445

Open
ddpasa opened this issue Aug 2, 2024 · 69 comments
Open

Support FLUX series models #1445

ddpasa opened this issue Aug 2, 2024 · 69 comments

Comments

@ddpasa
Copy link

ddpasa commented Aug 2, 2024

These models have just been released and appear to be amazing. Links below:

Blog from fal.ai: https://blog.fal.ai/flux-the-largest-open-sourced-text2img-model-now-available-on-fal/

Huggingface: https://huggingface.co/black-forest-labs

There is a schell version and dev version.

@Oliverkuien
Copy link

very agree!

@LazyCat420
Copy link

is it possible to finetune the model on a 3090 or do we have to do a lora due to the size?

@ThereforeGames
Copy link

ThereforeGames commented Aug 3, 2024

I'm wondering if image gen models would benefit from the sophisticated quantization methods that are popular in the LLM space, like GGUF. Any ongoing research in this area?

Apparently some folks have trained LoRAs on quantized LLMs to good effect, e.g. https://old.reddit.com/r/LocalLLaMA/comments/13q8zjc/how_much_why_does_quantization_negatively_affect/

@leonary
Copy link

leonary commented Aug 3, 2024

I totally agree. Since SD3 may not be able to fit a slightly larger dataset due to model problems (scripts include SimpleTuner, SD-scripts, OneTrainer), it is recommended to stop developing SD3 training scripts. I did a simple test on Flux-dev, and its capabilities are completely superior to SD3. Here are some examples:
zyi7Khvc7wnYcFh64fENU
DFVgNwtZxrLxJGkulrcN0
8ihTzsxMXqggjHJyC-N8k
fWPcyz5JIePCHseGBZqLj
It’s worth pointing out that this is the first model I’ve seen that can correctly draw the position of the umbrella handle and the umbrella cover, and the text prompt on the road sign is “iiilllllbddbwW”. Although the AI ​​didn’t draw it correctly, I haven’t seen any model that can draw it correctly either.

@dill-shower
Copy link

dill-shower commented Aug 4, 2024

I totally agree. Since SD3 may not be able to fit a slightly larger dataset due to model problems (scripts include SimpleTuner, SD-scripts, OneTrainer), it is recommended to stop developing SD3 training scripts. I did a simple test on Flux-dev, and its capabilities are completely superior to SD3. Here are some examples:

I strongly disagree. While the SD3 Medium model has certain drawbacks, it possesses a crucial advantage that FLUX lacks: its weights are publicly available. In contrast, FLUX only provides access to the base model's weights through an API, with no indication or information suggesting they plan to make it open-source. The models that are publicly accessible are derived through distillation of the base model; they are truncated, incomplete, and practically unsuitable for further training. It only makes sense to train the model we weren't given, as fine-tuning the distilled models would require roughly the same effort as training from scratch, if not more. Even the SDXL model was superior in this regard.

Calling it open-source is akin to labeling GPT-4o as open-source simply because we were given GPT-3 weights and the ability to fine-tune it. I'm concerned that we'll be wasting time that could be better spent studying SD3, debugging and optimizing its training script. SD3 has more potential, and Stability AI has promised to eventually release all models, including their weights, as open-source. This makes SD3 a more promising avenue for our efforts

@leonary
Copy link

leonary commented Aug 5, 2024

I strongly disagree. While the SD3 Medium model has certain drawbacks, it possesses a crucial advantage that FLUX lacks: its weights are publicly available. In contrast, FLUX only provides access to the base model's weights through an API, with no indication or information suggesting they plan to make it open-source. The models that are publicly accessible are derived through distillation of the base model; they are truncated, incomplete, and practically unsuitable for further training. It only makes sense to train the model we weren't given, as fine-tuning the distilled models would require roughly the same effort as training from scratch, if not more. Even the SDXL model was superior in this regard.

Hello, the weights for the Flux series models have been released, including the dev version and the schnell version. The weights for the Pro version have not been released and can only be accessed via API, but the performance gap between the dev version and the Pro version is not significant, and both should have surpassed SD3. You can find their weights here:
flux_dev
flux_schnell
Diffusers have initial support for LoRA training with Flux, which you can find here:
diffusers
SimpleTuner has initial compatibility with Flux's LoRA training in their scripts, which you can find here:
SimpleTuner
ComfyUI now supports Flux and its initial LoRA, which you can find here:
ComfyUI

@dill-shower
Copy link

Hello, the weights for the Flux series models have been released, including the dev version and the schnell version.

Please read this https://blog.fal.ai/flux-the-largest-open-sourced-text2img-model-now-available-on-fal/
Dev and schnell are obtained by distillation of pro scales. It is possible to create LoRAs for them, they will work. But full model training is practically impossible because of this

@ddpasa
Copy link
Author

ddpasa commented Aug 5, 2024

Hello, the weights for the Flux series models have been released, including the dev version and the schnell version.

Please read this https://blog.fal.ai/flux-the-largest-open-sourced-text2img-model-now-available-on-fal/ Dev and schnell are obtained by distillation of pro scales. It is possible to create LoRAs for them, they will work. But full model training is practically impossible because of this

It should be possible to fine-tune distilled models.

@dill-shower
Copy link

dill-shower commented Aug 5, 2024

> It should be possible to fine-tune distilled models.

Why should it? I just did a quick search for information about training SDXL Turbo, and it turns out it was also obtained through distillation from the base model. There are tons of such models on Civitai, but they're all created by merging SDXL Turbo with something else. I couldn't find a single one obtained through fine-tuning.The only relevant post I came across was a complaint on Reddit about how training SDXL Turbo produces very poor results. As I expected. https://www.reddit.com/r/StableDiffusion/comments/18l2qp0/sdxl_turbo_fine_tunemerging/

@ddpasa
Copy link
Author

ddpasa commented Aug 5, 2024

It should be possible to fine-tune distilled models.

Why should it? I just did a quick search for information about training SDXL Turbo, and it turns out it was also obtained through distillation from the base model. There are tons of such models on Civitai, but they're all created by merging SDXL Turbo with something else. I couldn't find a single one obtained through fine-tuning.The only relevant post I came across was a complaint on Reddit about how training SDXL Turbo produces very poor results. As I expected. https://www.reddit.com/r/StableDiffusion/comments/18l2qp0/sdxl_turbo_fine_tunemerging/

That is because the training code for Turbo was never released and nobody wrote one. It's not fundamentally impossible.

@bghira
Copy link

bghira commented Aug 7, 2024

even training schnell with lora or full tune is fine. they're just big models and require the use of LoRA with quantised base weights, but Kohya should probably wait for the bugs to be worked out in Quanto first before going ahead and trying to integrate it. it makes a mess of the model state dict keys.

@BenDes21
Copy link

BenDes21 commented Aug 7, 2024

@kohya-ss Training scripts released : https://github.com/XLabs-AI/x-flux

@bghira
Copy link

bghira commented Aug 7, 2024

those are pretty minimal and eg. it doesn't implement cosmap/logit-norm or any of the SD3 training details, just about the same as cloneofsimo/minRF implementation. in fact it's basically identical - the interesting thing there is probably their ControlNet training implementation details

@FurkanGozukara
Copy link

diffusers scripts arrived

https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_flux.md

@ddpasa
Copy link
Author

ddpasa commented Aug 9, 2024

diffusers scripts arrived

https://github.com/huggingface/diffusers/blob/main/examples/dreambooth/README_flux.md

@FurkanGozukara , you are amazing as usual!

@FurkanGozukara
Copy link

@ddpasa thanks

pull request arrived already :d

https://github.com/kohya-ss/sd-scripts/pull/1374/files/da4d0fe0165b3e0143c237de8cf307d53a9de45a..36b2e6fc288c57f496a061e4d638f5641c32c9ea

@cyan2k
Copy link

cyan2k commented Aug 10, 2024

It should be possible to fine-tune distilled models.

Why should it? I just did a quick search for information about training SDXL Turbo, and it turns out it was also obtained through distillation from the base model. There are tons of such models on Civitai, but they're all created by merging SDXL Turbo with something else. I couldn't find a single one obtained through fine-tuning.The only relevant post I came across was a complaint on Reddit about how training SDXL Turbo produces very poor results. As I expected. https://www.reddit.com/r/StableDiffusion/comments/18l2qp0/sdxl_turbo_fine_tunemerging/

Do you guys also love it when someone is so confidently incorrect?

My flux finetune is coming in very nicely. Huge upgrade compared to SDXL and Pony, also way more trainable than SD3 Medium. It's literally impossible to add NSFW to SD3 medium because of the complete lack of NSFW content in its training data. No finetuner is going to finish SAI's pathetic job. Nobody is ever going to create any kind of content for SD3 when you can create better results for the same money with flux. so yeah, rip.

Flux seems to have seen plenty of NSFW images, and it's just filtered and dropped out via captioning. So the context and knowledge already exists in the latent space, and it only needs to... well get finetuned.

So, yeah f*ck SD3. Pyro's NSFW model goes FLUX.

@protector131090
Copy link

It should be possible to fine-tune distilled models.

Why should it? I just did a quick search for information about training SDXL Turbo, and it turns out it was also obtained through distillation from the base model. There are tons of such models on Civitai, but they're all created by merging SDXL Turbo with something else. I couldn't find a single one obtained through fine-tuning.The only relevant post I came across was a complaint on Reddit about how training SDXL Turbo produces very poor results. As I expected. https://www.reddit.com/r/StableDiffusion/comments/18l2qp0/sdxl_turbo_fine_tunemerging/

Do you guys also love it when someone is so confidently incorrect?

My flux finetune is coming in very nicely. Huge upgrade compared to SDXL and Pony, also way more trainable than SD3 Medium. It's literally impossible to add NSFW to SD3 medium because of the complete lack of NSFW content in its training data. No finetuner is going to finish SAI's pathetic job. Nobody is ever going to create any kind of content for SD3 when you can create better results for the same money with flux. so yeah, rip.

Flux seems to have seen plenty of NSFW images, and it's just filtered and dropped out via captioning. So the context and knowledge already exists in the latent space, and it only needs to... well get finetuned.

So, yeah f*ck SD3. Pyro's NSFW model goes FLUX.

what are you talking about? i trained 3.0 for 30 minutes and it can generate nsfw just fine. NSFW link https://imgur.com/a/sd-30-test-G7G7G6u

@dill-shower
Copy link

It should be possible to fine-tune distilled models.

Why should it? I just did a quick search for information about training SDXL Turbo, and it turns out it was also obtained through distillation from the base model. There are tons of such models on Civitai, but they're all created by merging SDXL Turbo with something else. I couldn't find a single one obtained through fine-tuning.The only relevant post I came across was a complaint on Reddit about how training SDXL Turbo produces very poor results. As I expected. https://www.reddit.com/r/StableDiffusion/comments/18l2qp0/sdxl_turbo_fine_tunemerging/

Do you guys also love it when someone is so confidently incorrect?
My flux finetune is coming in very nicely. Huge upgrade compared to SDXL and Pony, also way more trainable than SD3 Medium. It's literally impossible to add NSFW to SD3 medium because of the complete lack of NSFW content in its training data. No finetuner is going to finish SAI's pathetic job. Nobody is ever going to create any kind of content for SD3 when you can create better results for the same money with flux. so yeah, rip.
Flux seems to have seen plenty of NSFW images, and it's just filtered and dropped out via captioning. So the context and knowledge already exists in the latent space, and it only needs to... well get finetuned.
So, yeah f*ck SD3. Pyro's NSFW model goes FLUX.

what are you talking about? i trained 3.0 for 30 minutes and it can generate nsfw just fine. NSFW link https://imgur.com/a/sd-30-test-G7G7G6u

Someone just reads too much reddit and similar places where everyone is convinced that if a model wasn't trained on nsfw then they will never be able to create such things. How they used to create models for anime, furry and the rest for sdxl no one knows. Lost technology

In all seriousness, there's nothing stopping sd3 from learning to create any nsfw content and even worse. Due to the more efficient architecture, training does not require as much GPU overhead as sdxl.

I don't understand why everyone is so crazy with this FLUX and minus my comment that it has no scales and access only by api

@cyan2k
Copy link

cyan2k commented Aug 11, 2024

Someone just reads too much reddit and similar places where everyone is convinced that if a model wasn't trained on nsfw then they will never be able to create such things.

We (group of SDXL finetuners) spend like 5k bucks making NSFW in SD3 work, but a model that can't even render women lying in grass is so lobotomized that re-introducing NSFW takes immense ressources, as in the ballpark of SAI's training infrastructure. No hobby finetuner is going to pay for that. Nobody is going to pay for that, if they can get way better results for a fraction of the cost with FLUX.

It's not hard to understand. It took 20 bucks to teach FLUX NSFW concepts.... 5k$ vs 20$ pretty clear cut.

How they used to create models for anime, furry and the rest for sdxl no one knows. Lost technology

Well it seems that you don't know the basic of how training such models work and how self-organisation of embeddings in the latent space works. LAION, the data corpus of SDXL, is full of furry and anime shit. SD3 data corpus has exactly 0 NSFW images in it. And you honestly have difficulties to understand why one is trainable and the later isn't? You're on the wrong board then.

Please stop talking about things you don't have a clue about.

Also FLUX is runnable locally and the weights are public, so I don't even know what " it has no scales and access only by api" even means.

@dill-shower
Copy link

dill-shower commented Aug 11, 2024

We (group of SDXL finetuners) spend like 5k bucks making NSFW in SD3 work, but a model that can't even render women lying in grass is so lobotomized

Stabilityai promised to release the 3.1 model soon. They promised to fix this problem in it. You've been too quick to educate yourself

Well it seems that you don't know the basic of how training such models work and how self-organisation of embeddings in the latent space works. LAION, the data corpus of SDXL, is full of furry and anime shit

When sdxl came out it was written about on reddit the same thing they are now writing about sd3. That it didn't use NSFW content in training so nsfw training is impossible, "it's a terrible model, stabilityai killed their reputation by refusing to train on nsfw content, we can't use it, we stay on sd1.5".... Just like they wrote about sd2.... Let's wait a year and find out that there was nsfw in the sd3 dataset but it was removed from the sd4 dataset so we stay on sd3 and boycott the new model....

Also FLUX is runnable locally and the weights are public

Please give me link to download Flux-pro model

@D3voz
Copy link

D3voz commented Aug 12, 2024

People are using simple tuner for flux lora creation. Unfortunately it has no windows support. Waiting for kohya ss :) . Flux dev is so much better than sd3 💯

@kohya-ss
Copy link
Owner

People are using simple tuner for flux lora creation. Unfortunately it has no windows support. Waiting for kohya ss :) . Flux dev is so much better than sd3 💯

Now sd3 branch supports FLUX.1 dev LoRA training experimentally :)
https://github.com/kohya-ss/sd-scripts/tree/sd3

@leonary
Copy link

leonary commented Aug 12, 2024

Stabilityai promised to release the 3.1 model soon. They promised to fix this problem in it. You've been too quick to educate yourself

If SD3.1 could achieve the performance of Flux Dev while allowing training and sharing, and if the machine costs required for fine-tuning are lower than those of Flux Dev, I would be very willing to use SD3.1. However, given the performance of SD3 8b and the licensing of the SD3 series, I am pessimistic about this possibility.

@leonary
Copy link

leonary commented Aug 12, 2024

Now sd3 branch supports FLUX.1 dev LoRA training experimentally :) https://github.com/kohya-ss/sd-scripts/tree/sd3

Thank you for your excellent work. The fine-tuning effect of sd-scripts with Flux has completely met my expectations, and its performance is on par with Simple Tuner.

Additionally, is there any plan to support Flux in some of the LoRA processing scripts? These scripts could help the community more quickly develop models like "detail enhancer."

@Tophness
Copy link

Tophness commented Aug 12, 2024

People are using simple tuner for flux lora creation. Unfortunately it has no windows support. Waiting for kohya ss :) . Flux dev is so much better than sd3 💯

Now sd3 branch supports FLUX.1 dev LoRA training experimentally :) https://github.com/kohya-ss/sd-scripts/tree/sd3

Will this work for the NF4 model that was released yesterday?
Up to 4x speedups, reduced vram, increased quality.

https://civitai.com/models/638572/nf4-flux1
lllyasviel/stable-diffusion-webui-forge#981

@bghira
Copy link

bghira commented Aug 12, 2024

you don't need an A100 for flux. imo kohya should release sooner than keep trying to add the million features. you can train on 16G VRAM without any quantisation at all.

@Tophness
Copy link

Tophness commented Aug 12, 2024

you don't need an A100 for flux. imo kohya should release sooner than keep trying to add the million features. you can train on 16G VRAM without any quantisation at all.

It did in other trainers such as your own, but yeah apparently not anymore.
image

The NF4 model is far superior though and more accessible for inference.
FP8 used to be virtually unusable on my 4080 because it'd take about 5-10 mins for 1 overquantized generation since it overloads my shared memory, and now it's <1 min for outputs that look on par with Pro.
Don't really wanna waste a week training an FP8 model that's already obsolete and can't be used by most people.

@bghira
Copy link

bghira commented Aug 12, 2024

it's not like that at all though. fp8 is fine, especially in pytorch 2.4. you can read back through the comments in this issue to see.

@bghira
Copy link

bghira commented Aug 12, 2024

also, NF4 is definitely not "on par with Pro" 🤪

@pyros-sd-models
Copy link

pyros-sd-models commented Aug 14, 2024

I trained cfg back into flux.dev with a 6 hour LoRA accidentally.

LoRA ≠ full finetune

Are you speedrunning "how often can I be wrong in one thread"?

In the LLM world, "making a LoRA" and "fine-tuning" are essentially synonymous. For example, the most popular library for fine-tuning LLMs, https://github.com/unslothai/unsloth, only mentions fine-tunes, but it actually produces LoRAs. Crazy, right?

Somehow, only the Stable Diffusion community feels the need to differentiate between the two.

Mathematically, it doesn't really matter whether you change the weights directly in the model or through an adapter.

I always cringe when I read, "I prefer fine-tuning and then extracting a LoRA to directly training a LoRA," especially since Dora is a thing now. There's absolutely no reason to do a complete and expensive fine-tune if your goal is to create a LoRA.

But, well, the SD community is pretty big on bro-science. It won't take long until someone tries to convince me that fine-tuning > LoRA or some nonsense, because in their anecdotal N=2 experiment, one image was subjectively nicer than the other, and it just so happened to be a full moon.

@b-7777777
Copy link

b-7777777 commented Aug 15, 2024

Latest update to SD3 branch script made this smooth sailing for 12GB vram flux lora training, I've got a 4 hour timer on 1600 steps at the recommended settings and its at most using 7GB of my 12GB GPU. BLESS YOU KOHYA!!! <3

@D3voz
Copy link

D3voz commented Aug 15, 2024

For some reason my flux lora trained on kohya ss when using in comfyui runs extremely slow. I get 200s/it with the kohya flux lora while I get 1.8s/it using other flux lora from civitai

@leonary
Copy link

leonary commented Aug 16, 2024

For some reason my flux lora trained on kohya ss when using in comfyui runs extremely slow. I get 200s/it with the kohya flux lora while I get 1.8s/it using other flux lora from civitai

Is the slow inference possibly due to insufficient VRAM or RAM? You could try training LoRA with a lower dim value, or consider using a machine with more VRAM and RAM for inference. The fine-tuning results of Kohya's script are quite good; this is my result of fine-tuning Flux using Kohya's script, and there hasn't been any slowdown in inference speed.

image

@D3voz
Copy link

D3voz commented Aug 16, 2024

Yes, using my kohya trained lora eats all my vram in f8 , tried with 16gb vram and 20 gb vram machines . Using other lora eats like around 12 gb vram . I might have done something wrong . I did use 512 resolution which gave me around 3sec/it during training on 4060ti 16 gb . I used this - accelerate launch --mixed_precision bf16 --num_cpu_threads_per_process 1 flux_train_network.py --pretrained_model_name_or_path "E:/ComfyUI/models/checkpoints/flux1-dev.safetensors" --clip_l "E:/ComfyUI/models/clip/clip_l.safetensors" --t5xxl "E:/ComfyUI/models/clip/t5xxl_fp16.safetensors" --ae "E:/ComfyUI/models/vae/ae.safetensors" --cache_latents_to_disk --save_model_as safetensors --sdpa --persistent_data_loader_workers --max_data_loader_n_workers 2 --seed 42 --gradient_checkpointing --mixed_precision bf16 --save_precision bf16 --network_module networks.lora_flux --network_dim 4 --optimizer_type adafactor --optimizer_args "relative_step=False" "scale_parameter=False" "warmup_init=False" --learning_rate 1e-4 --network_train_unet_only --cache_text_encoder_outputs --cache_text_encoder_outputs_to_disk --fp8_base --highvram --max_train_epochs 100 --save_every_n_epochs 10 --sample_every_n_steps 200 --sample_prompts "D:/github/ee/pt.txt" --sample_sampler "euler" --dataset_config "D:/github/ee/fx.toml" --output_dir "D:/github/kohya_ss/outputs" --output_name flux-lora-name --timestep_sampling sigmoid --model_prediction_type raw --guidance_scale 1.0 --loss_type l2 (might have changed a little if it gave error but this is what i saved on my text file to copy paste again)

@leonary
Copy link

leonary commented Aug 16, 2024

Yes, using my kohya trained lora eats all my vram in f8 , tried with 16gb vram and 20 gb vram machines . Using other lora eats like around 12 gb vram .

That's strange because a LoRA with a dim value of 4 shouldn't consume more VRAM. Assuming that most of the LoRAs on Civitai are currently fine-tuned with Simple Tuner, and considering that both Simple Tuner and SD scripts are only training the Unet, the VRAM requirements should be similar if the file sizes are comparable. Therefore, the slowdown in inference speed is likely due to cached data in VRAM and RAM that hasn't been cleared. You could try restarting your computer to fully clear the VRAM and RAM, or use some ComfyUI plugins to unload the cached data from RAM and VRAM, and then test the inference speed of SD scripts' LoRA again.

@bghira
Copy link

bghira commented Aug 16, 2024

LoRAs are fused during inference time in ComfyUI. there should be no slowdown

@D3voz
Copy link

D3voz commented Aug 16, 2024

om1
ome
For me it just goes out of memory when i use my lora trained on kohya . Lora rank 4 alpha 1 .

@leonary
Copy link

leonary commented Aug 16, 2024

For me it just goes out of memory when i use my lora trained on kohya . Lora rank 4 alpha 1 .

Would you mind sharing this lora? I would like to see how much video memory this lora would occupy on my machine.

@D3voz
Copy link

D3voz commented Aug 16, 2024

Unfortunately the lora is a person lora(ai person but still not mine) so i cannot send it. I will train a celeb lora again tomorrow and will send the link. Only thing i can think of is that i did cropped the images to 512x512 as i am using 512 res for training . It is sad because the sampling images during training were very good and close to the actual look. Here 1st one is training sampling image and second one is training image.
cv
Edit- The lora worked when using with forge and loading the single f8 version model. It takes just around 12gb vram . Forge does patch the lora. I am not sure if comfy also needs that. The quality of the lora is very good but face likeness is not as much as the images generated during training sampling(maybe a forge thing).

@ssube
Copy link

ssube commented Aug 20, 2024

@D3voz the high memory usage with some LoRAs has been reported in the ComfyUI repo: comfyanonymous/ComfyUI#4343 . The slow inference is specific to Windows and the way it uses shared GPU memory, on Linux it simply runs out of memory (or since the recent updates, partially loads the LoRA with undefined results).

@DarkAlchy
Copy link

Long thread, and I am late to the party. I want no part of SD3, or SAI, but we now have other options we can train on as good as Flux is that is truly open? I, along with a lot of lawyers, don't like their licence, but we all wait for clarification from them (if it comes). I have a 4090 and training at BS1 will be the one thing to cause me to toss in the towel on all this. A lora taking 90 minutes to train without even the clip (that I drastically need for what I do) is far too long. If this is the future, then I rather rot in antiquity.

@ddpasa
Copy link
Author

ddpasa commented Aug 21, 2024

Long thread, and I am late to the party. I want no part of SD3, or SAI, but we now have other options we can train on as good as Flux is that is truly open? I, along with a lot of lawyers, don't like their licence, but we all wait for clarification from them (if it comes). I have a 4090 and training at BS1 will be the one thing to cause me to toss in the towel on all this. A lora taking 90 minutes to train without even the clip (that I drastically need for what I do) is far too long. If this is the future, then I rather rot in antiquity.

SD3 is a terrible model with very complicated legal mess of a license. Given how much better Flux.1 is, I see no reason to waste any of my time on SD3. Stability AI is a huge mess right now, and it's really good that new models are coming to the scene.

@DarkAlchy
Copy link

Long thread, and I am late to the party. I want no part of SD3, or SAI, but we now have other options we can train on as good as Flux is that is truly open? I, along with a lot of lawyers, don't like their licence, but we all wait for clarification from them (if it comes). I have a 4090 and training at BS1 will be the one thing to cause me to toss in the towel on all this. A lora taking 90 minutes to train without even the clip (that I drastically need for what I do) is far too long. If this is the future, then I rather rot in antiquity.

SD3 is a terrible model with very complicated legal mess of a license. Given how much better Flux.1 is, I see no reason to waste any of my time on SD3. Stability AI is a huge mess right now, and it's really good that new models are coming to the scene.

I agree. I thought there was something else as I am not going back to SAI if I can help it.

@bghira
Copy link

bghira commented Aug 21, 2024

flux and sd3 have the same license, but both are unenforceable anyway. just have fun and do less drama.

@DarkAlchy
Copy link

You know, there are adults who make a living and don't wish to spend it all on lawyers and court costs defending ourselves. It isn't about just popping out some waifu/husbando.

@bghira
Copy link

bghira commented Aug 21, 2024

what does it have to do with this thread? act like an adult if you are one. some of us are researchers who dont care about how open a model is and can make a living regardless.

@cosmicoxytocin
Copy link

cosmicoxytocin commented Aug 21, 2024

You know, there are adults who make a living and don't wish to spend it all on lawyers and court costs defending ourselves. It isn't about just popping out some waifu/husbando.

I can assure you, researchers in the field of image-synthesis (outside Medical) are not moral busybodies opposed to waifus and husbandos.
Regardless, your opinion is irrelevant to the discussion. Go write a tumblr blog about it, if you must.

@bash-j
Copy link

bash-j commented Aug 22, 2024

Traceback (most recent call last):
  File "/home/mikey/kohya_ss/sd-scripts/finetune/prepare_buckets_latents.py", line 286, in <module>
    main(args)
  File "/home/mikey/kohya_ss/sd-scripts/finetune/prepare_buckets_latents.py", line 89, in main
    vae = model_util.load_vae(args.model_name_or_path, weight_dtype)
  File "/home/mikey/kohya_ss/sd-scripts/library/model_util.py", line 1304, in load_vae
    converted_vae_checkpoint = convert_ldm_vae_checkpoint(vae_sd, vae_config)
  File "/home/mikey/kohya_ss/sd-scripts/library/model_util.py", line 429, in convert_ldm_vae_checkpoint
    new_checkpoint["quant_conv.weight"] = vae_state_dict["quant_conv.weight"]

What is this quant_conv.weight it is trying to find in the vae? These are keys from the SDXL VAE no? Not flux. I can't see it in the file. It looks like it's ignoring that I provided a path to the vae file also and trying to load it from the flux model file.

@popovidis
Copy link

Flux would be awesome

@kohya-ss
Copy link
Owner

What is this quant_conv.weight it is trying to find in the vae? These are keys from the SDXL VAE no? Not flux. I can't see it in the file. It looks like it's ignoring that I provided a path to the vae file also and trying to load it from the flux model file.

Sorry, prepare_buckets_latents.py doesn't support FLUX yet.

@bghira
Copy link

bghira commented Aug 22, 2024

4 hours for 1600 steps is really really slow. you can rent a $0.200/hr 4090 and train 1600 steps in one hour. for less than $1. it probably costs you more per-kWh than it would to rent a cloud GPU:

Epoch 5/8, Steps:  99%|██████████▊| 9946/10000 [11:43:08<02:46,  3.09s/it, lr=6e-5, step_loss=0.384]

training on 10x 3090s for $2.20/hr. total = $25

thanks runpod.

@hablaba
Copy link

hablaba commented Aug 22, 2024

I tried to see if I can train Lora with Prodigy and it appears to work… weirdly only using ~17GB of VRAM, so not much more than when I was using AdamW8bit. That seems… wrong. Is there anything I’m missing on why Prodigy would “work” but potentially not be doing what I expect?

@hablaba
Copy link

hablaba commented Aug 23, 2024

Well, I can confirm training with prodigy works great. Just using my same settings as what I’d use in SDXL, setting d_coef to 2, betas 0.9,0.999, weight_decay 0.01. No warmup steps. Was able to get a great Lora of my dog at 1000 steps in about 35 mins on a 4090 (~2.2s/it) All other settings the same as the default recommendation on sd3 readme (except LR is 1 of course). Also used dim 16 and alpha 16 to match it to remove any scaling. Only used 17GB VRAM.

Got better results than my attempted training using AdamW8bit constant for 3000 steps at 1e-4 and 1e-3. In those it seemed both undertrained (sometimes prompted other unrelated things) and also overfit (realistic style when prompting line art or comic book style)

@markrmiller
Copy link

markrmiller commented Aug 28, 2024

Just to note for anyone else that’s been trying this or is just starting - at least for me, if you go with full bf16 and the fused backward pass option, I get messed up hands and then distorted body parts or bodies very quickly regardless of learning rate. No full bf16 and the fused optimizer option and I don’t get that. Surprisingly the latter also appears to use less VRAM, certainly not any more, and is faster. Of course, keep in mind things are changing, but that’s been my experience the last few days.

This is for full finetune by the way, not Lora. Have not tried a Lora yet with this repo.

@oovm
Copy link

oovm commented Sep 1, 2024

Can you add an script to quantify checkpoint (BF16) to NF4?
Many users do not have powerful hardware. I hope my model can be used for more people.

@wogam
Copy link

wogam commented Sep 1, 2024

I tried to see if I can train Lora with Prodigy and it appears to work… weirdly only using ~17GB of VRAM, so not much more than when I was using AdamW8bit. That seems… wrong. Is there anything I’m missing on why Prodigy would “work” but potentially not be doing what I expect?

What settings are you using to use such low VRAM? AdamW8bit training with 512px images is using almost 24gb ram for me.

Edit: if you have any caption drop out and it doesn't cache the captions, the text encoders are loaded into the GPU memory during training time which leads to the high GPU usage.

@FurkanGozukara
Copy link

I tried to see if I can train Lora with Prodigy and it appears to work… weirdly only using ~17GB of VRAM, so not much more than when I was using AdamW8bit. That seems… wrong. Is there anything I’m missing on why Prodigy would “work” but potentially not be doing what I expect?

What settings are you using to use such low VRAM? AdamW8bit training with 512px images is using almost 24gb ram for me.

Edit: if you have any caption drop out and it doesn't cache the captions, the text encoders are loaded into the GPU memory during training time which leads to the high GPU usage.

with adafactor and 512px i go as low as 7.5 GB : https://youtu.be/nySGu12Y05k

@iamrohitanshu
Copy link

iamrohitanshu commented Sep 3, 2024

@wogam "The training can be done with 12GB VRAM GPUs with Adafactor optimizer, --split_mode and train_blocks=single options." according to the readme file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests