Skip to content

Commit

Permalink
Improve doc
Browse files Browse the repository at this point in the history
  • Loading branch information
michaelbenayoun committed May 3, 2024
1 parent 1f7dec4 commit 17ebda6
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 32 deletions.
5 changes: 3 additions & 2 deletions docs/source/guides/distributed_training.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,9 @@ But there is a caveat: each Neuron core is an independent data-parallel worker b
To alleviate that, `optimum-neuron` supports parallelism features enabling you to harness the full power of your Trainium instance:

1. [ZeRO-1](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.html): It is an optimization of data-parallelism which consists in sharding the optimizer state (which usually represents half of the memory needed on the device) over the data-parallel ranks.
2. [Tensor Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html): It is a technique which consists in sharding each of your model parameters along a given dimension on multiple devices. It also known as intra-layer model parallelism. The number of devices to shard your parameters on is called the `tensor_parallel_size`.
3. [Pipeline Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pipeline_parallelism_overview.html): It is a technique consisting in sharding the model block layers on multiple devices. It is also known as inter-layer model parallelism. The number of devices to shard your layers on is called the `pipeline_parallel_size`.
2. [Tensor Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html): It is a technique which consists in sharding each of your model matrix-multiplications along a given axis (row or column) on multiple devices. It also known as intra-layer model parallelism. The number of devices to shard your parameters on is called the `tensor_parallel_size`.
3. [Sequence parallelism](https://arxiv.org/pdf/2205.05198.pdf): It is an optimization over Tensor Parallelism which shards the activations on the sequence axis outside of the tensor parallel regions. It is useful because it saves memory by sharding the activations.
4. [Pipeline Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pipeline_parallelism_overview.html): It consists in sharding the model block layers on multiple devices. It is also known as inter-layer model parallelism. The number of devices to shard your layers on is called the `pipeline_parallel_size`.


The good news is that is it possible to combine those techniques, and `optimum-neuron` makes it very easy!
Expand Down
63 changes: 33 additions & 30 deletions docs/source/training_tutorials/finetune_llm.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -21,24 +21,26 @@ This tutorial will teach you how to fine-tune open source LLMs like [Llama 3](ht

You will learn how to:

1. [Setup AWS environment](#1-setup-aws-environment)
1. [Setup AWS Environment](#1-setup-aws-environment)
2. [Load and process the dataset](#2-load-and-prepare-the-dataset)
3. [Fine-tune Llama on AWS Trainium using the `NeuronTrainer`](#3-fine-tune-llama-on-aws-trainium-using-the-neurontrainer)
4. [Evaluate and test fine-tuned Llama model](#4-evaluate-and-test-fine-tuned-llama-model)
4. [Launch Training](#4-launch-training)
5. [Evaluate and test fine-tuned Llama model](#5-evaluate-and-test-fine-tuned-llama-model)


<Tip>

While we will use `Llama-3 8b` in this tutorial, it is completely possible to use other models, simply by swtiching the `model_id`.
While we will use `Llama-3 8B` in this tutorial, it is completely possible to use other models, simply by swtiching the `model_id`.
For instance, it is possible to fine-tune:

- [Mistral 7b (`mistralai/Mistral-7B-Instruct-v0.2`)](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
- [Llama-2 7b (`meta-llama/Llama-2-7b-hf`)](https://huggingface.co/meta-llama/Llama-2-7b-hf)
- Mistral models, such as [Mistral 7b (`mistralai/Mistral-7B-Instruct-v0.2`)](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2)
- Llama-2 models, such as [Llama-2 7b (`meta-llama/Llama-2-7b-hf`)](https://huggingface.co/meta-llama/Llama-2-7b-hf)

</Tip>
And many others!

</Tip>

## Setup
## 1. Setup AWS Environment

Before starting this tutorial, you will need to:

Expand All @@ -47,18 +49,16 @@ Before starting this tutorial, you will need to:
```bash
huggingface-cli login --token YOUR_TOKEN
```
3. Make sure you have access to the model. Some open source models are gated, meaning that users need to apply to the model owner to be able to use the model weights. Here we will be training Llama-3 8b, for which there are two possibilities:
3. Make sure you have access to the model. Some open source models are gated, meaning that users need to apply to the model owner to be able to use the model weights. Here we will be training Llama-3 8B, for which there are two possibilities:
* Official gated repo: [`meta-llama/Meta-Llama-3-8B`](https://huggingface.co/meta-llama/Meta-Llama-3-8B)
* Un-gated repo: [`NousResearch/Meta-Llama-3-8B`](https://huggingface.co/NousResearch/Meta-Llama-3-8B)
# TODO: add link to code
4. Clone the Optimum Neuron repository, which contains the [code]() for this tutorial.
4. Clone the Optimum Neuron repository, which contains the [complete script](https://github.com/huggingface/optimum-neuron/docs/source/training_tutorials/finetune_llm.py) described in this tutorial.
```bash
git clone https://github.com/huggingface/optimum-neuron.git
```

# TODO: rename files and update notebook
**Note**: There is a notebook version of that this [here](https://github.com/huggingface/optimum-neuron/blob/main/notebooks/text-generation/llama2-7b-fine-tuning.ipynb)*.

_Note: There is a notebook version of that this [here](https://github.com/huggingface/optimum-neuron/blob/main/notebooks/text-generation/llama2-7b-fine-tuning.ipynb)._

## 2. Load and prepare the dataset

Expand Down Expand Up @@ -106,7 +106,6 @@ The following function `pack_dataset` takes a `dataset` and a `chunk_length` and
from functools import partial
from itertools import chain
# empty list to save remainder from batches to use in next batch
remainder = {"input_ids": [], "attention_mask": [], "token_type_ids": []}
Expand Down Expand Up @@ -196,22 +195,22 @@ lm_dataset.save_to_disk(dataset_path)

Normally you would use the **[Trainer](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.Trainer)** and **[TrainingArguments](https://huggingface.co/docs/transformers/main/en/main_classes/trainer#transformers.TrainingArguments)** to fine-tune PyTorch-based transformer models.

But together with AWS, we have developed the `NeuronTrainer` to improve performance, robustness, and ease-of-use when training on Trainium instances. It can be used as a 1-to-1 replacement for the `Trainer`.
But together with AWS, we have developed the [~`optimum.neuron.NeuronTrainer`] to improve performance, robustness, and ease-of-use when training on Trainium instances. It can be used as a 1-to-1 replacement for the `Trainer`.

When it comes to distributed training on AWS Trainium there are a few things we need to take care of. Since Llama-3 8b is a big model it will not fit on a single Neuron core, thats why we support different distributed-training strategies:
When it comes to distributed training on AWS Trainium there are a few things we need to take care of. Since Llama-3 8B is a big model it will not fit on a single Neuron core, thats why we support different distributed-training strategies:

- [ZeRO-1](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.html): shards the optimizer state over multiple devices.
- [Tensor Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html): shards the model parameters along a given dimension on multiple devices, defined with `tensor_parallel_size`
- [Sequence parallelism](https://arxiv.org/pdf/2205.05198.pdf) shards the activations on the sequence axis outside of the tensor parallel regions. It is useful because it saves memory by sharding the activations.
- [Pipeline Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pipeline_parallelism_overview.html): shards the model layers on multiple devices.
1. [ZeRO-1](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/zero1_gpt2.html): It is an optimization of data-parallelism which consists in sharding the optimizer state (which usually represents half of the memory needed on the device) over the data-parallel ranks.
2. [Tensor Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/tensor_parallelism_overview.html): It is a technique which consists in sharding each of your model matrix-multiplications along a given axis (row or column) on multiple devices. It also known as intra-layer model parallelism. The number of devices to shard your parameters on is called the `tensor_parallel_size`.
3. [Sequence parallelism](https://arxiv.org/pdf/2205.05198.pdf): It is an optimization over Tensor Parallelism which shards the activations on the sequence axis outside of the tensor parallel regions. It is useful because it saves memory by sharding the activations.
4. [Pipeline Parallelism](https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/pipeline_parallelism_overview.html): It consists in sharding the model block layers on multiple devices. It is also known as inter-layer model parallelism. The number of devices to shard your layers on is called the `pipeline_parallel_size`.

<Tip>

If you want to know more about distributed training you can take a look at the [documentation](https://huggingface.co/docs/optimum-neuron/guides/distributed_training).

</Tip>

Here, since we want to fine-tune an 8b model, we will not need to use pipeline parallelism.
Here, since we want to fine-tune an 8B model, we will not need to use pipeline parallelism.
Our training code will look as follows:

```python
Expand All @@ -236,33 +235,33 @@ trainer = Trainer(
# Start training
trainer.train()

trainer.save_model() # Saves the tokenizer too for easy upload
trainer.save_model() # saves the tokenizer too for easy upload
```

The key points here are:

- We use the `lazy_load_for_parallelism` context manager to lazily load the model. This will not load the full model weights on each worker, but instead only load the required weights. **This is much more memory efficient, and often mandatory to use.**
- We use the `NeuronTrainer` to perform training. It will take the lazily loaded model, along with the `training_args`, which are an instance of `NeuronTrainingArguments`, and will handle all the parallelization and training on the Neuron cores.
- We use the [~`optimum.neuron.NeuronTrainer`] to perform training. It will take the lazily loaded model, along with the `training_args`, which are an instance of [~`optimum.neuron.NeuronTrainingArguments`], and will handle all the parallelization and training on the Neuron cores.

## 4. Launch Training

## Launch the training script

# TODO add link
We prepared a script called [finetune_llm.py](), which implements all of the code mentioned in this tutorial so you do not have to write it yourself.
We prepared a script called [finetune_llm.py](https://github.com/huggingface/optimum-neuron/docs/source/training_tutorials/finetune_llm.py) summing up everything mentioned in this tutorial.

<Tip>

This script, and everything presented in this tutorial is inspired from our official example training script to run causal language modeling training, called [run_clm.py](https://github.com/huggingface/optimum-neuron/blob/main/examples/language-modeling/run_clm.py).
This script, and everything presented in this tutorial are inspired from our official example training script to run causal language modeling fine-tuning, called [run_clm.py](https://github.com/huggingface/optimum-neuron/blob/main/examples/language-modeling/run_clm.py).

Feel free to take `finetune_llm.py` or `run_clm.py` and adapt them to your own needs.

</Tip>

### Precompilation

When training models on AWS Trainium we first need to compile our model with our training arguments.

To overcome this, we added a [model cache repository](https://huggingface.co/docs/optimum-neuron/guides/cache_system), which allows us to use precompiled models and configuration from the Hugging Face Hub to skip the compilation step. But be careful: every change in the config will lead to a new compilation, which could result in some cache misses.

**Note**: If your configuration is not cached please open an issue on [Github](https://github.com/huggingface/optimum-neuron/issues), we are happy to include it.
_Note: If your configuration is not cached please open an issue on [Github](https://github.com/huggingface/optimum-neuron/issues), we are happy to include it._

We pre-compiled the config for our training already meaning you can either skip the cell below or rerun it will only take a few minutes since it reuses the cached configuration.

Expand Down Expand Up @@ -291,7 +290,9 @@ _Note: Compiling without a cache can take ~40 minutes. It will also create dummy
rm -rf dolly_llama
```

After the compilation is done we can start our training with a similar command, we just need to remove the `neuron_parallel_compile`.
### Actual Training

After the compilation is done we can start our training with a similar command, we just need to remove the use of `neuron_parallel_compile`.

We will use `torchrun` to launch our training script. `torchrun` is a tool that automatically distributes a PyTorch model across multiple accelerators. We can pass the number of accelerators as `nproc_per_node` arguments alongside our hyperparameters.

Expand All @@ -316,10 +317,12 @@ MALLOC_ARENA_MAX=64 torchrun --nproc_per_node=32 finetune_llm.py \
--gradient_accumulation_steps 16
```

Thats it, we successfully trained Llama-3 8b on AWS Trainium. # TODO: UPDATE THE TIME HERE.The training took for 3 epochs on dolly (15k samples) took 43:24 minutes where the raw training time was only 31:46 minutes. This leads to a cost of ~$15.5 for the e2e training on the trn1.32xlarge instance. Not Bad!
Thats it, we successfully trained Llama-3 8B on AWS Trainium. # TODO: UPDATE THE TIME HERE.The training took for 3 epochs on dolly (15k samples) took 43:24 minutes where the raw training time was only 31:46 minutes. This leads to a cost of ~$15.5 for the e2e training on the trn1.32xlarge instance. Not Bad!

But before we can share and test our model we need to consolidate our model. Since we used Tensor Parallelism during training, we need to consolidate the model weights before we can use it. Tensor Parallelism shards the model weights accross different workers, only sharded checkpoints will be saved during training.

### Consolidate the Checkpoint

The Optimum CLI provides a way of doing that very easily via the `optimum neuron consolidate`` command:

```bash
Expand Down

0 comments on commit 17ebda6

Please sign in to comment.