Run Instruction

Follow steps in README.md
Training script in 2.2 Run this recipe for T5 If running on AzureML,

cd huggingface/script
python hf-ort.py --gpu_cluster_name <gpu_cluster_name> --hf_model t5-large --run_config ort

If running locally,

cd huggingface/script
python hf-ort.py --hf_model t5-large --run_config ort --process_count <process_count> --local_run

Performance Comparison

Run configuration	PyTorch	ORTModule	Gain
fp16	262.34	329.02	25.4%
fp16 with deepspeed stage 1	253.25	277.47	9.6%

These numbers are average of samples/sec from 10 runs on ND40rs_v2 VMs (V100 32G x 8), Cuda 11, with stable release onnxruntime_training-1.8.0%2Bcu111-cp36-cp36m-manylinux2014_x86_64.whl with batch size of 16. Cuda 10.2 option is also available through --use_cu102 flag. Please check dependency details in Dockerfile. We look at the metrics stable_train_samples_per_second in the log, which discards first step that includes setup time. Also please note since ORTModule takes some time to do initial setup, smaller --max_steps value may lead to longer total run time for ORTModule compared to PyTorch. However, if you want to see finetuning to finish faster, adjust --max_steps to a smaller value. Lastly, we do not recommend running this recipe on [NC] series VMs which uses old architecture (K80).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

T5.md

T5.md

Run Instruction

Performance Comparison

Convergence

Files

T5.md

Latest commit

History

T5.md

File metadata and controls

Run Instruction

Performance Comparison

Convergence