running packages when running the "Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance" sample #720

yahavb · 2024-10-17T19:03:18Z

System Info

PyTorch 1.13.1 with NeuronX Training and HuggingFace transformers
Neuron 2.18.0
Python - Version Options - 3.10 (py310)
DLC 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04

Who can help?

@michaelbenayoun @JingyaHuang

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Precompilation step in https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm is failing on many missing packages. Is there a specific DLC we can use?

Expected behavior

Running the tutorial successfully. The "Fine-tune and Test Llama-3 8B on AWS Trainium" tutorial works with no issue with the same settings.

The text was updated successfully, but these errors were encountered:

michaelbenayoun · 2024-10-18T14:53:51Z

Do you have the names of the packages that are missing by any chance please?

yahavb · 2024-10-18T15:38:20Z

docker run -it --privileged  -v /home/ec2-user:/home/ubuntu/ 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04 bash

apt-get update 
...
pip install --upgrade pip
....
pip3 install peft trl
...
git clone https://github.com/huggingface/optimum-neuron.git
cd optimum-neuron
pip3 install .
....
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
neuronx-cc 2.13.66.0+6dfecc895 requires protobuf<3.20, but you have protobuf 3.20.3 which is incompatible.
....
#!/bin/bash
set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"

PROCESSES_PER_NODE=8

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
    MAX_STEPS=$((LOGGING_STEPS + 5))
else
    MAX_STEPS=-1
fi


XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE docs/source/training_tutorials/sft_lora_finetune_llm.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir
....
+ export NEURON_FUSE_SOFTMAX=1
+ NEURON_FUSE_SOFTMAX=1
+ export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
+ NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
+ export MALLOC_ARENA_MAX=64
+ MALLOC_ARENA_MAX=64
+ export 'NEURON_CC_FLAGS=--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
+ NEURON_CC_FLAGS='--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
+ PROCESSES_PER_NODE=8
+ NUM_EPOCHS=1
+ TP_DEGREE=2
+ PP_DEGREE=1
+ BS=1
+ GRADIENT_ACCUMULATION_STEPS=8
+ LOGGING_STEPS=1
+ MODEL_NAME=meta-llama/Meta-Llama-3-8B
+ OUTPUT_DIR=output-
+ '[' '' = 1 ']'
+ MAX_STEPS=-1
+ XLA_USE_BF16=1
+ neuron_parallel_compile torchrun --nproc_per_node 8 docs/source/training_tutorials/sft_lora_finetune_llm.py --model_id meta-llama/Meta-Llama-3-8B --num_train_epochs 1 --do_train --learning_rate 5e-5 --warmup_ratio 0.03 --max_steps -1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --gradient_checkpointing true --bf16 --zero_1 false --tensor_parallel_size 2 --pipeline_parallel_size 1 --logging_steps 1 --save_total_limit 1 --output_dir output- --lr_scheduler_type constant --overwrite_output_dir
Traceback (most recent call last):
  File "/usr/local/bin/neuron_parallel_compile", line 5, in <module>
    from optimum.neuron.utils.neuron_parallel_compile import main
  File "/usr/local/lib/python3.10/site-packages/optimum/neuron/utils/neuron_parallel_compile.py", line 8, in <module>
    from torch_neuronx.parallel_compile.neuron_parallel_compile import LOGGER as torch_neuronx_logger
ModuleNotFoundError: No module named 'torch_neuronx.parallel_compile'

I tried to grab the neuron drivers:

echo 'deb https://apt.repos.neuron.amazonaws.com jammy main' > /etc/apt/sources.list.d/neuron.list
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add - && apt-get update
apt-get install -y aws-neuronx-collectives=2.* aws-neuronx-runtime-lib=2.* aws-neuronx-tools=2.*
echo "export PATH=/opt/aws/neuron/bin:\$PATH" >> /root/.bashrc
PATH="${PATH}:/opt/aws/neuron/bin"

and python -c "import torch_neuronx" runs with no errors but no help

I then removed neuron_parallel_compile and got:
...
Traceback (most recent call last):
File "", line 1027, in _find_and_load
File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 11, in
File "", line 1006, in _find_and_load_unlocked
from optimum.neuron import NeuronHfArgumentParser as HfArgumentParser
File "/usr/local/lib/python3.10/site-packages/optimum/neuron/init.py", line 18, in
File "", line 688, in _load_unlocked
from .trainers import Seq2SeqTrainiumTrainer, TrainiumTrainer
File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 20, in
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
from transformers import Seq2SeqTrainer, Trainer
File "", line 1075, in _handle_fromlist
File "/usr/local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 26, in
File "/usr/local/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1462, in getattr
from .trainer import Trainer
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 180, in
import torch_xla.distributed.spmd as xs
ModuleNotFoundError: No module named 'torch_xla.distributed.spmd'
...

So I tried reinstall

pip install torch-neuronx optimum[neuron] transformers

and still got the same ModuleNotFoundError: No module named 'torch_xla.distributed.spmd' error

yahavb added the bug Something isn't working label Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

running packages when running the "Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance" sample #720

running packages when running the "Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance" sample #720

yahavb commented Oct 17, 2024 •

edited

Loading

michaelbenayoun commented Oct 18, 2024

yahavb commented Oct 18, 2024

running packages when running the "Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance" sample #720

running packages when running the "Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance" sample #720

Comments

yahavb commented Oct 17, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction (minimal, reproducible, runnable)

Expected behavior

michaelbenayoun commented Oct 18, 2024

yahavb commented Oct 18, 2024

yahavb commented Oct 17, 2024 •

edited

Loading