Skip to content

Commit

Permalink
update README
Browse files Browse the repository at this point in the history
  • Loading branch information
kohya-ss committed Sep 2, 2024
1 parent 4f6d915 commit 6abacf0
Showing 1 changed file with 14 additions and 6 deletions.
20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ Options are almost the same as LoRA training. The difference is `--full_bf16`, `

`--blockwise_fused_optimizers` enables the fusing of the optimizer step into the backward pass for each block. This is similar to `--fused_backward_pass`. Any optimizer can be used, but Adafactor is recommended for memory efficiency. `--blockwise_fused_optimizers` cannot be used with `--fused_backward_pass`. Stochastic rounding is not supported for now.

`--double_blocks_to_swap` and `--single_blocks_to_swap` are the number of double blocks and single blocks to swap. The default is None (no swap). These options must be combined with `--fused_backward_pass` or `--blockwise_fused_optimizers`. `--double_blocks_to_swap` can be specified with `--single_blocks_to_swap`. The recommended maximum number of blocks to swap is 9 for double blocks and 18 for single blocks.
`--double_blocks_to_swap` and `--single_blocks_to_swap` are the number of double blocks and single blocks to swap. The default is None (no swap). These options must be combined with `--fused_backward_pass` or `--blockwise_fused_optimizers`. `--double_blocks_to_swap` can be specified with `--single_blocks_to_swap`. The recommended maximum number of blocks to swap is 9 for double blocks and 18 for single blocks. Please see the next chapter for details.

`--cpu_offload_checkpointing` is to offload the gradient checkpointing to CPU. This reduces about 2GB of VRAM usage.

Expand All @@ -198,24 +198,32 @@ The learning rate and the number of epochs are not optimized yet. Please adjust

#### Key Features for FLUX.1 fine-tuning

1. Sample Image Generation:
1. Technical details of double/single block swap:
- Reduce memory usage by transferring double and single blocks of FLUX.1 from GPU to CPU when they are not needed.
- During forward pass, the weights of the blocks that have finished calculation are transferred to CPU, and the weights of the blocks to be calculated are transferred to GPU.
- The same is true for the backward pass, but the order is reversed. The gradients remain on the GPU.
- Since the transfer between CPU and GPU takes time, the training will be slower.
- `--double_blocks_to_swap` and `--single_blocks_to_swap` specify the number of blocks to swap. For example, `--double_blocks_to_swap 6` swaps 6 blocks at each step of training, but the remaining 13 blocks are always on the GPU.
- About 640MB of memory can be saved per double block, and about 320MB of memory can be saved per single block.

2. Sample Image Generation:
- Sample image generation during training is now supported.
- The prompts are cached and used for generation if `--cache_latents` is specified. So changing the prompts during training will not affect the generated images.
- Specify options such as `--sample_prompts` and `--sample_every_n_epochs`.
- Note: It will be very slow when `--split_mode` is specified.

2. Experimental Memory-Efficient Saving:
3. Experimental Memory-Efficient Saving:
- `--mem_eff_save` option can further reduce memory consumption during model saving (about 22GB).
- This is a custom implementation and may cause unexpected issues. Use with caution.

3. T5XXL Token Length Control:
4. T5XXL Token Length Control:
- Added `--t5xxl_max_token_length` option to specify the maximum token length of T5XXL.
- Default is 512 in dev and 256 in schnell models.

4. Multi-GPU Training Support:
5. Multi-GPU Training Support:
- Note: `--double_blocks_to_swap` and `--single_blocks_to_swap` cannot be used in multi-GPU training.

5. Disable mmap Load for Safetensors:
6. Disable mmap Load for Safetensors:
- `--disable_mmap_load_safetensors` option now works in `flux_train.py`.
- Speeds up model loading during training in WSL2.
- Effective in reducing memory usage when loading models during multi-GPU training.
Expand Down

0 comments on commit 6abacf0

Please sign in to comment.