You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi all,
I am following the instructions to install pixart-sigma on my local linux server. On 1.3 You are ready to train! I am getting now the following error:
python3 -m torch.distributed.launch --nproc_per_node=1 --master_port=12345 train_scripts/train.py configs/pixart_sigma_config/PixArt_sigma_xl2_img512_internalms.py --load-from output/pretrained_models/PixArt-Sigma-XL-2-512-MS.pth --work-dir output/your_first_pixart-exp --debug
/home/test/.local/lib/python3.10/site-packages/torch/distributed/launch.py:181: FutureWarning: The module torch.distributed.launch is deprecated
and will be removed in future. Use torchrun.
Note that --use-env is set by default in torchrun.
If your script expects `--local-rank` argument to be set, please
change it to read from `os.environ['LOCAL_RANK']` instead. See
https://pytorch.org/docs/stable/distributed.html#launch-utility for
further instructions
warnings.warn(
/home/test/.local/lib/python3.10/site-packages/mmcv/__init__.py:20: UserWarning: On January 1, 2023, MMCV will release v2.0.0, in which it will remove components related to the training process and add a data transformation module. In addition, it will rename the package names mmcv to mmcv-lite and mmcv-full to mmcv. See https://github.com/open-mmlab/mmcv/blob/master/docs/en/compatibility.md for more details.
warnings.warn(
2024-08-18 10:48:52,906 - PixArt - INFO - Distributed environment: MULTI_GPU Backend: nccl
Num processes: 1
Process index: 0
Local process index: 0
Device: cuda:0
Mixed precision type: fp16
2024-08-18 10:48:52,941 - PixArt - INFO - Config:
data_root = 'pixart-sigma-toy-dataset'
data = dict(
type='InternalDataMSSigma',
root='InternData',
image_list_json=['data_info.json'],
transform='default_train',
load_vae_feat=False,
load_t5_feat=False)
image_size = 512
train_batch_size = 2
eval_batch_size = 16
use_fsdp = False
valid_num = 0
fp32_attention = True
model = 'PixArtMS_XL_2'
aspect_ratio_type = 'ASPECT_RATIO_512'
multi_scale = True
pe_interpolation = 1.0
qk_norm = False
kv_compress = False
kv_compress_config = dict(sampling=None, scale_factor=1, kv_compress_layer=[])
num_workers = 10
train_sampling_steps = 1000
visualize = True
deterministic_validation = False
eval_sampling_steps = 500
model_max_length = 300
lora_rank = 4
num_epochs = 10
gradient_accumulation_steps = 1
grad_checkpointing = True
gradient_clip = 0.01
gc_step = 1
auto_lr = dict(rule='sqrt')
validation_prompts = [
'dog',
'portrait photo of a girl, photograph, highly detailed face, depth of field',
'Self-portrait oil painting, a beautiful cyborg with golden hair, 8k',
'Astronaut in a jungle, cold color palette, muted colors, detailed, 8k',
'A photo of beautiful mountain with realistic sunset and blue lake, highly detailed, masterpiece'
]
optimizer = dict(
type='CAMEWrapper',
lr=2e-05,
weight_decay=0.0,
eps=(1e-30, 1e-16),
betas=(0.9, 0.999, 0.9999))
lr_schedule = 'constant'
lr_schedule_args = dict(num_warmup_steps=1000)
save_image_epochs = 1
save_model_epochs = 5
save_model_steps = 2500
sample_posterior = True
mixed_precision = 'fp16'
scale_factor = 0.13025
ema_rate = 0.9999
tensorboard_mox_interval = 50
log_interval = 1
cfg_scale = 4
mask_type = 'null'
num_group_tokens = 0
mask_loss_coef = 0.0
load_mask_index = False
vae_pretrained = 'output/pretrained_models/pixart_sigma_sdxlvae_T5_diffusers/vae'
load_from = 'output/pretrained_models/PixArt-Sigma-XL-2-512-MS.pth'
resume_from = None
snr_loss = False
real_prompt_ratio = 0.5
class_dropout_prob = 0.1
work_dir = 'output/your_first_pixart-exp'
s3_work_dir = None
micro_condition = False
seed = 43
skip_step = 0
loss_type = 'huber'
huber_c = 0.001
num_ddim_timesteps = 50
w_max = 15.0
w_min = 3.0
ema_decay = 0.95
image_list_json = ['data_info.json']
2024-08-18 10:48:52,941 - PixArt - INFO - World_size: 1, seed: 43
2024-08-18 10:48:52,941 - PixArt - INFO - Initializing: DDP for training
You are using the default legacy behaviour of the <class 'transformers.models.t5.tokenization_t5.T5Tokenizer'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thouroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.28s/it]
Traceback (most recent call last):
File "/home/test/AI/PixArt-sigma/train_scripts/train.py", line 359, in <module>
args.pipeline_load_from, subfolder="text_encoder", torch_dtype=torch.float16).to(accelerator.device)
File "/home/test/.local/lib/python3.10/site-packages/transformers/modeling_utils.py", line 2460, in to
return super().to(*args, **kwargs)
File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1145, in to
return self._apply(convert)
File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 797, in _apply
module._apply(fn)
[Previous line repeated 4 more times]
File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 820, in _apply
param_applied = fn(param)
File "/home/test/.local/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1143, in convert
return t.to(device, dtype if t.is_floating_point() or t.is_complex() else None, non_blocking)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 80.00 MiB (GPU 0; 7.78 GiB total capacity; 7.17 GiB already allocated; 69.38 MiB free; 7.18 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 8550) of binary: /usr/bin/python3
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 196, in <module>
main()
File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 192, in main
launch(args)
File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launch.py", line 177, in launch
run(args)
File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/test/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 250, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train_scripts/train.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-08-18_10:49:19
host : test-MD34764
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 8550)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
GPU is nvidia 2060 super with 8GB
I understand that there is an issue with the memory allocation however not sure howto resolve it and in first place (at least in my mind) it should not happen anyway :)
Anyone can help please?
Thanks a lot in advance
The text was updated successfully, but these errors were encountered:
Hi all,
I am following the instructions to install pixart-sigma on my local linux server. On 1.3 You are ready to train! I am getting now the following error:
GPU is nvidia 2060 super with 8GB
I understand that there is an issue with the memory allocation however not sure howto resolve it and in first place (at least in my mind) it should not happen anyway :)
Anyone can help please?
Thanks a lot in advance
The text was updated successfully, but these errors were encountered: