Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix multi-node environment training and accelerator related codes + skip file check option #1246

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

aria1th
Copy link

@aria1th aria1th commented Apr 8, 2024

The Accelerator setup / etc was using loop, with explicit local process index check instead of process index check, resulting multi-node training hang forever.

After struggling against code for weeks, the slurm batch script works for multi node, at least for sdxl train_network and sdxl train (finetune).

#!/bin/bash
#SBATCH --job-name=multinode
#SBATCH --output=O-%x.%j
#SBATCH --error=E-%x.%j
#SBATCH --partition=<PARTITION>
#SBATCH --nodes=3                   # number of nodes
#SBATCH --gres=gpu:4              # number of GPUs per node
#SBATCH --time=72:00:00             # maximum execution time (HH:MM:SS)
#SBATCH --cpus-per-gpu=16
#SBATCH --qos=<QOS_NAME>


######################
### Set enviroment ###
######################
# Activate your Python environment
conda init
conda activate kohya
unset LD_LIBRARY_PATH
# Change to the directory containing your script
cd ~/large_train/sd-scripts
gpu_count=$(scontrol show job $SLURM_JOB_ID | grep -oP 'TRES=.*?gpu=\K(\d+)' | head -1)
######################
#### Set network #####
######################
head_node_ip=$(scontrol show hostnames $SLURM_JOB_NODELIST | head -n 1)
PORT=29508 # set this to unused port
######################
export TORCH_DISTRIBUTED_DEBUG=INFO
export NCCL_DEBUG=INFO
export NCCL_P2P_DISABLE=1 # Disable this for general multi-node setup
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=ALL
# export TORCH_DISTRIBUTED_DEBUG=INFO
#######################
export NCCL_ASYNC_ERROR_HANDLING=0
# export NCCL_DEBUG=INFO
# export NCCL_DEBUG_SUBSYS=COLL
# export NCCL_SOCKET_NTHREADS=1
# export NCCL_NSOCKS_PERTHREAD=1
# export CUDA_LAUNCH_BLOCKING=1
#######################
echo "SLURM_JOB_NODELIST is $SLURM_JOB_NODELIST"
node_name=$(echo $SLURM_JOB_NODELIST | sed 's/node-list: //' | cut -d, -f1)
MASTER_ADDR=$(getent ahosts $node_name | head -n 1 | awk '{print $1}')

export SCRIPT="~/large_train/sd-scripts/sdxl_train.py "
export SCRIPT_ARGS=" \
    --config_file ~/train_config_a6000_multinode.toml"

# for each nodes, set machine_rank int and launch
for node in $(scontrol show hostnames $SLURM_JOB_NODELIST); do
    export RANK=0
    export LOCAL_RANK=0
    export WORLD_SIZE=$gpu_count
    export MASTER_ADDR=$head_node_ip
    export MASTER_PORT=$PORT
    export NODE_RANK=$(scontrol show hostnames $SLURM_JOB_NODELIST | grep -oP "$node" | wc -l)
    export LAUNCHER="/home/usr/miniconda3/envs/kohya/bin/accelerate launch \
        --num_processes $gpu_count \
        --num_machines $SLURM_NNODES \
        --rdzv_backend c10d \
        --main_process_ip $head_node_ip \
        --main_process_port $PORT \
        --machine_rank $NODE_RANK"
    echo "node: $node, rank: $RANK, local_rank: $LOCAL_RANK, world_size: $WORLD_SIZE, master_addr: $MASTER_ADDR, master_port: $MASTER_PORT, node_rank: $NODE_RANK"
    CMD="$LAUNCHER $SCRIPT $SCRIPT_ARGS"
    srun --nodes=1 --ntasks=1 --ntasks-per-node=1 $CMD &
done

wait

Success log:

2024-04-08 16:48:47 INFO     Accelerator prepared at cuda:1 /  sdxl_train.py:203
                             process index : 4, local process                   
                             index : 1                                          
                    INFO     Waiting for everyone /            sdxl_train.py:204
                             他のプロセスを待機中    
2024-04-08 16:48:48 INFO     Accelerator prepared at cuda:0 /  sdxl_train.py:203
                             process index : 4, local process                   
                             index : 0                                          
                    INFO     Waiting for everyone /            sdxl_train.py:204
                             他のプロセスを待機中     
...

2024-04-08 16:48:50 INFO     All processes are ready /         sdxl_train.py:206
                             すべてのプロセスが準備完了  

                    INFO     loading model for process 1 3 sdxl_train_util.py:28
                             /4      
...

2024-04-08 16:49:01 INFO     model loaded for all          sdxl_train_util.py:56
                             processes 0 2 /4  

steps:   0%|          | 1/643744 [02:04<22266:39:32, 124.52s/it]
steps:   0%|          | 1/643744 [02:04<22266:50:43, 124.52s/it, avr_loss=0.0703]
steps:   0%|          | 2/643744 [02:15<12092:54:30, 67.63s/it, avr_loss=0.0703] 
steps:   0%|          | 2/643744 [02:15<12092:58:25, 67.63s/it, avr_loss=0.0918]
steps:   0%|          | 3/643744 [02:25<8695:59:50, 48.63s/it, avr_loss=0.0918] 
steps:   0%|          | 3/643744 [02:25<8696:01:52, 48.63s/it, avr_loss=0.0919]

....

Also, the skip_file_existence_check = true option is added, to skip verify process in training start.

This can be only enabled if all files are usable, since it passes os.path.exists() process for all files.

@kohya-ss
Copy link
Owner

kohya-ss commented Apr 8, 2024

Thank you for this! I didn't use multi node training, but this seems to be good.

@evkogs
Copy link

evkogs commented Aug 11, 2024

Hi @kohya-ss, it's @GrigoryEvko here.
I used this pr on 3 A100*8 nodes 2 months ago, it works fine, it can be merged.
I feel that for flux models training it's even more related than previously.

My dev branch (a but outdated) with these updates is here: dev...evkogs:sd-scripts:dev

I didn't try to save training state with this PR, maybe #1340 is required as well.

I can test and create new PR with latest dev to merge into, but it'd be most useful to merge flux, sd3 and this pr into dev, I can help a bit with these too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants