add llama example (#1382)

* add llama example * lint * more lint * introduct use_peft flag * update readme * address comments --------- Co-authored-by: Prathik Rao <[email protected]@orttrainingdev8.d32nl1ml4oruzj4qz3bqlggovf.px.internal.cloudapp.net>
huggingface · Sep 19, 2023 · 7fc27f6 · 7fc27f6
1 parent 89d08c4
commit 7fc27f6
Show file tree

Hide file tree

Showing 3 changed files with 841 additions and 0 deletions.
diff --git a/examples/onnxruntime/training/text-classification/README.md b/examples/onnxruntime/training/text-classification/README.md
@@ -16,6 +16,60 @@ limitations under the License.
 
 # Text classification
 
+By running the script [`run_classification.py`](https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/training/text-classification/run_classification.py),
+we will be able to leverage the [`ONNX Runtime`](https://github.com/microsoft/onnxruntime) accelerator to fine-tune the models from the
+[HuggingFace hub](https://huggingface.co/models) for text classification task.
+
+
+__The following example applies the acceleration features powered by ONNX Runtime.__
+
+
+### ONNX Runtime Training
+
+The following example fine-tunes [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) on the [Amazon Reviews Dataset](https://huggingface.co/datasets/amazon_reviews_multi).
+
+```bash
+torchrun --nproc_per_node=NUM_GPUS_YOU_HAVE run_classification.py \
+    --model_name_or_path meta-llama/Llama-2-7b-hf \
+    --dataset_name amazon_reviews_multi \
+    --dataset_config_name en \
+    --shuffle_train_dataset \
+    --metric_name accuracy \
+    --text_column_name 'review_title,review_body,product_category' \
+    --text_column_delimiter ' ' \
+    --label_column_name stars \
+    --do_train \
+    --do_eval \
+    --fp16 \
+    --max_seq_length 128 \
+    --per_device_train_batch_size 16 \
+    --learning_rate 2e-5 \
+    --num_train_epochs 1 \
+    --deepspeed zero_stage_2.json \
+    --use_peft \
+    --output_dir /tmp/ort-llama-2/
+```
+
+### Performance
+
+We get the following results for [meta-llama/Llama-2-7b-hf](https://huggingface.co/meta-llama/Llama-2-7b-hf) using mixed-precision-training/LoRA/ZeRO-Stage-2 under PyTorch and ONNX Runtime backends. 8 Nvidia V100 cards were used to run the
+experiment for 10 epochs:
+
+| Model                       | Backend      | Runtime(s)      | Train samples(/s)   |
+| --------------------------- |------------- | --------------- | ------------------- |
+| meta-llama/Llama-2-7b-hf    | PyTorch      | 17035.9055      | 117.399             |
+| meta-llama/Llama-2-7b-hf    | ONNX Runtime | 15532.2403      | 128.764             |
+
+We observe the gain of ONNX Runtime compared to PyTorch as follow:
+
+| Model                     | Latency | Throughput |
+| ------------------------- | ------- | ---------- |
+| meta-llama/Llama-2-7b-hf  | 8.83%   | 9.68%      |
+
+#### DeepSpeed
+
+[zero_stage_2.json](https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/training/text-classification/zero_stage_2.json) is an example DeepSpeed config file to enable Stage-2 parameter sharing for training meta-llama/Llama-2-7b. More information can be found at [DeepSpeed's official repo](https://github.com/microsoft/DeepSpeed).
+
 ## GLUE Tasks
 
 By running the script [`run_glue.py`](https://github.com/huggingface/optimum/blob/main/examples/onnxruntime/training/text-classification/run_glue.py),