Fix multi-node environment training and accelerator related codes + skip file check option #1246

aria1th · 2024-04-08T08:30:50Z

The Accelerator setup / etc was using loop, with explicit local process index check instead of process index check, resulting multi-node training hang forever.

After struggling against code for weeks, the slurm batch script works for multi node, at least for sdxl train_network and sdxl train (finetune).

#!/bin/bash
#SBATCH --job-name=multinode
#SBATCH --output=O-%x.%j
#SBATCH --error=E-%x.%j
#SBATCH --partition=<PARTITION>
#SBATCH --nodes=3                   # number of nodes
#SBATCH --gres=gpu:4              # number of GPUs per node
#SBATCH --time=72:00:00             # maximum execution time (HH:MM:SS)
#SBATCH --cpus-per-gpu=16
#SBATCH --qos=<QOS_NAME>


######################
### Set enviroment ###
######################
# Activate your Python environment
conda init
conda activate kohya
unset LD_LIBRARY_PATH
# Change to the directory containing your script
cd ~/large_train/sd-scripts
gpu_count=$(scontrol show job $SLURM_JOB_ID | grep -oP 'TRES=.*?gpu=\K(\d+)' | head -1)
######################
#### Set network #####
######################
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
PORT=29508 # set this to unused port
######################
export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=1 # Disable this for general multi-node setup
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=ALL
# export TORCH_DISTRIBUTED_DEBUG=INFO
#######################
export NCCL_ASYNC_ERROR_HANDLING=0
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=COLL
# export NCCL_SOCKET_NTHREADS=1
# export NCCL_NSOCKS_PERTHREAD=1
# export CUDA_LAUNCH_BLOCKING=1
#######################
echo "SLURM_JOB_NODELIST is $SLURM_JOB_NODELIST"
node_name=$(echo $SLURM_JOB_NODELIST | sed 's/node-list: //' | cut -d, -f1)
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')

export SCRIPT="~/large_train/sd-scripts/sdxl_train.py "
export SCRIPT_ARGS=" \
    --config_file ~/train_config_a6000_multinode.toml"

# for each nodes, set machine_rank int and launch
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
    export RANK=0
    export LOCAL_RANK=0
    export WORLD_SIZE=$gpu_count
    export MASTER_ADDR=$head_node_ip
    export MASTER_PORT=$PORT
    export NODE_RANK=$(scontrol show hostnames $SLURM_JOB_NODELIST | grep -oP "$node" | wc -l)
    export LAUNCHER="/home/usr/miniconda3/envs/kohya/bin/accelerate launch \
        --num_processes $gpu_count \
        --num_machines $SLURM_NNODES \
        --rdzv_backend c10d \
        --main_process_ip $head_node_ip \
        --main_process_port $PORT \
        --machine_rank $NODE_RANK"
    echo "node: $node, rank: $RANK, local_rank: $LOCAL_RANK, world_size: $WORLD_SIZE, master_addr: $MASTER_ADDR, master_port: $MASTER_PORT, node_rank: $NODE_RANK"
    CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS"
    srun --nodes=1 --ntasks=1 --ntasks-per-node=1 $CMD &
done

wait

Success log:

2024-04-08 16:48:47 INFO     Accelerator prepared at cuda:1 /  sdxl_train.py:203
                             process index : 4, local process                   
                             index : 1                                          
                    INFO     Waiting for everyone /            sdxl_train.py:204
                             他のプロセスを待機中    
2024-04-08 16:48:48 INFO     Accelerator prepared at cuda:0 /  sdxl_train.py:203
                             process index : 4, local process                   
                             index : 0                                          
                    INFO     Waiting for everyone /            sdxl_train.py:204
                             他のプロセスを待機中     
...

2024-04-08 16:48:50 INFO     All processes are ready /         sdxl_train.py:206
                             すべてのプロセスが準備完了  

                    INFO     loading model for process 1 3 sdxl_train_util.py:28
                             /4      
...

2024-04-08 16:49:01 INFO     model loaded for all          sdxl_train_util.py:56
                             processes 0 2 /4  

steps:   0%|          | 1/643744 [02:04<22266:39:32, 124.52s/it]
steps:   0%|          | 1/643744 [02:04<22266:50:43, 124.52s/it, avr_loss=0.0703]
steps:   0%|          | 2/643744 [02:15<12092:54:30, 67.63s/it, avr_loss=0.0703] 
steps:   0%|          | 2/643744 [02:15<12092:58:25, 67.63s/it, avr_loss=0.0918]
steps:   0%|          | 3/643744 [02:25<8695:59:50, 48.63s/it, avr_loss=0.0918] 
steps:   0%|          | 3/643744 [02:25<8696:01:52, 48.63s/it, avr_loss=0.0919]

....

Also, the skip_file_existence_check = true option is added, to skip verify process in training start.

This can be only enabled if all files are usable, since it passes os.path.exists() process for all files.

I hate that noone has tested this before?

fix bug from wandb fix

kohya-ss · 2024-04-08T22:55:57Z

Thank you for this! I didn't use multi node training, but this seems to be good.

evkogs · 2024-08-11T09:43:59Z

Hi @kohya-ss, it's @GrigoryEvko here.
I used this pr on 3 A100*8 nodes 2 months ago, it works fine, it can be merged.
I feel that for flux models training it's even more related than previously.

My dev branch (a but outdated) with these updates is here: dev...evkogs:sd-scripts:dev

I didn't try to save training state with this PR, maybe #1340 is required as well.

I can test and create new PR with latest dev to merge into, but it'd be most useful to merge flux, sd3 and this pr into dev, I can help a bit with these too.

aria1th and others added 10 commits April 8, 2024 15:41

fix multi-node train and add skip-check for file existences

704c5cb

fix wandb and accelerator again

361eebd

fix typo for action

27df908

fix missing option apply

d87d673

set environment arg before import

8af1f6a

better tqdm handling

3a7409b

Fix wandb logging with multi-gpu environment

b5b87b8

I hate that noone has tested this before?

Fix wandb logging with multi-gpu environment

8ea7e71

fix missing arg

e2425ef

fix bug from wandb fix

f5617dd

fix bug from wandb fix

aria1th and others added 4 commits April 10, 2024 19:57

Update .gitignore

454b6c4

add multi-gpu version

15e11d2

Create merge_jsons.py

c27e278

Create multi_gpu_prepare.sh

fb29b02

support sdxl conversion....

def4829

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multi-node environment training and accelerator related codes + skip file check option #1246

Fix multi-node environment training and accelerator related codes + skip file check option #1246

aria1th commented Apr 8, 2024

kohya-ss commented Apr 8, 2024

evkogs commented Aug 11, 2024

Fix multi-node environment training and accelerator related codes + skip file check option #1246

Are you sure you want to change the base?

Fix multi-node environment training and accelerator related codes + skip file check option #1246

Conversation

aria1th commented Apr 8, 2024

kohya-ss commented Apr 8, 2024

evkogs commented Aug 11, 2024