Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux lora training does not start #1589

Open
nim00e opened this issue Sep 10, 2024 · 3 comments
Open

flux lora training does not start #1589

nim00e opened this issue Sep 10, 2024 · 3 comments

Comments

@nim00e
Copy link

nim00e commented Sep 10, 2024

running training / 学習開始
num train images * repeats / 学習画像の数×繰り返し回数: 3
num reg images / 正則化画像の数: 0
num batches per epoch / 1epochのバッチ数: 3
num epochs / epoch数: 250
batch size per device / バッチサイズ: 1
gradient accumulation steps / 勾配を合計するステップ数 = 1
total optimization steps / 学習ステップ数: 750
steps: 0%| | 0/750 [00:00<?, ?it/s]2024-09-10 15:23:27 INFO unet dtype: torch.float16, device: cpu train_network.py:1046
INFO text_encoder [0] dtype: torch.float16, device: cpu train_network.py:1052
INFO text_encoder [1] dtype: torch.float16, device: cpu train_network.py:1052

epoch 1/250
INFO epoch is incremented. current_epoch: 0, epoch: 1 train_util.py:668
Traceback (most recent call last):
File "/workspace/nima_workspace/kohya_ss/sd-scripts/flux_train_network.py", line 519, in
trainer.train(args)
File "/workspace/nima_workspace/kohya_ss/sd-scripts/train_network.py", line 1141, in train
noise_pred, target, timesteps, huber_c, weighting = self.get_noise_pred_and_target(
File "/workspace/nima_workspace/kohya_ss/sd-scripts/flux_train_network.py", line 380, in get_noise_pred_and_target
model_pred = unet(
File "/workspace/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/workspace/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/nima_workspace/kohya_ss/sd-scripts/library/flux_models.py", line 1008, in forward
img = self.img_in(img)
File "/workspace/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/workspace/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/workspace/miniconda3/envs/kohya/lib/python3.10/site-packages/torch/nn/modules/linear.py", line 117, in forward
return F.linear(input, self.weight, self.bias)
RuntimeError: mat1 and mat2 must have the same dtype, but got Float and Half
steps: 0%| | 0/750 [02:24<?, ?it/s]
Traceback (most recent call last):
File "/workspace/miniconda3/envs/kohya/bin/accelerate", line 8, in
sys.exit(main())
File "/workspace/miniconda3/envs/kohya/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
args.func(args)
File "/workspace/miniconda3/envs/kohya/lib/python3.10/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
simple_launcher(args)
File "/workspace/miniconda3/envs/kohya/lib/python3.10/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/workspace/miniconda3/envs/kohya/bin/python', '/workspace/nima_workspace/kohya_ss/sd-scripts/flux_train_network.py', '--config_file', '/workspace/nima_workspace/kohya_ss/experiments/test_lora/img_2/model/config_lora-20240910-152248.toml']' returned non-zero exit status 1.

The training does not start and exits with the above error.

I see that torach.cuda.is_available returns True, so why is that the device for text encoder and unet is cpu?
Any help is appreciated

@kohya-ss
Copy link
Owner

Your accelerate config seems to be configured with cpu. Please run accelerate config again to use gpu.

@nim00e
Copy link
Author

nim00e commented Sep 12, 2024

I set accelerate config to gpu and gave [all] for gpu ids. but still facing this issue

@kohya-ss
Copy link
Owner

steps: 0%| | 0/750 [00:00<?, ?it/s]2024-09-10 15:23:27 INFO unet dtype: torch.float16, device: cpu train_network.py:1046

This line shows U-Net (DiT) is on CPU. Did the output of this line change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants