-
Notifications
You must be signed in to change notification settings - Fork 843
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix multi-node environment training and accelerator related codes + skip file check option #1246
base: main
Are you sure you want to change the base?
Conversation
I hate that noone has tested this before?
fix bug from wandb fix
Thank you for this! I didn't use multi node training, but this seems to be good. |
Hi @kohya-ss, it's @GrigoryEvko here. My dev branch (a but outdated) with these updates is here: dev...evkogs:sd-scripts:dev I didn't try to save training state with this PR, maybe #1340 is required as well. I can test and create new PR with latest dev to merge into, but it'd be most useful to merge flux, sd3 and this pr into dev, I can help a bit with these too. |
The Accelerator setup / etc was using loop, with explicit local process index check instead of process index check, resulting multi-node training hang forever.
After struggling against code for weeks, the slurm batch script works for multi node, at least for sdxl train_network and sdxl train (finetune).
Success log:
Also, the
skip_file_existence_check = true
option is added, to skip verify process in training start.This can be only enabled if all files are usable, since it passes os.path.exists() process for all files.