- Follow steps in README.md
- Training script in 2.2 Run this recipe for T5 If running on AzureML,
cd huggingface/script
python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model t5-large --run_config ort
If running locally,
cd huggingface/script
python hf-ort.py --hf_model t5-large --run_config ort --process_count <process_count> --local_run
Run configuration | PyTorch | ORTModule | Gain |
---|---|---|---|
fp16 | 262.34 | 329.02 | 25.4% |
fp16 with deepspeed stage 1 | 253.25 | 277.47 | 9.6% |
These numbers are average of samples/sec from 10 runs on ND40rs_v2
VMs (V100 32G x 8), Cuda 11, with stable release onnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl
with batch size of 16. Cuda 10.2 option is also available through --use_cu102
flag. Please check dependency details in Dockerfile. We look at the metrics stable_train_samples_per_second
in the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller --max_steps
value may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust --max_steps
to a smaller value. Lastly, we do not recommend running this recipe on [NC
] series VMs which uses old architecture (K80).