t5_11

housing our model example of fine tuning an 11B t5 with FSDP to create a world-class grammar checker.

pip install -r requirements.txt

a large and small dataset are already present in the project (grammar_train.csv = small, gtrain_150K.csv = large).

torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="localhost:5679" main_benchmark.py

On an A100 (p4d.24xlarge) you should expect to see:

To train with mp spawn:

python main.py

Or better, with torchrun:

torchrun --nnodes=1 --nproc_per_node=8 --rdzv_id=101 --rdzv_endpoint="localhost:5679" main_elastic.py

You can control the model size, dataset size, batch size, etc. all in the config/defaults.py

Provide feedback