Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running packages when running the "Supervised Fine-Tuning of Llama 3 8B on one AWS Trainium instance" sample #720

Open
2 of 4 tasks
yahavb opened this issue Oct 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@yahavb
Copy link
Contributor

yahavb commented Oct 17, 2024

System Info

PyTorch 1.13.1 with NeuronX Training and HuggingFace transformers
Neuron 2.18.0
Python - Version Options - 3.10 (py310)
DLC 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04

Who can help?

@michaelbenayoun @JingyaHuang

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction (minimal, reproducible, runnable)

Precompilation step in https://huggingface.co/docs/optimum-neuron/en/training_tutorials/sft_lora_finetune_llm is failing on many missing packages. Is there a specific DLC we can use?

Expected behavior

Running the tutorial successfully. The "Fine-tune and Test Llama-3 8B on AWS Trainium" tutorial works with no issue with the same settings.

@yahavb yahavb added the bug Something isn't working label Oct 17, 2024
@michaelbenayoun
Copy link
Member

Do you have the names of the packages that are missing by any chance please?

@yahavb
Copy link
Contributor Author

yahavb commented Oct 18, 2024

docker run -it --privileged  -v /home/ec2-user:/home/ubuntu/ 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-training-neuronx:1.13.1-transformers4.36.2-neuronx-py310-sdk2.18.0-ubuntu20.04 bash

apt-get update 
...
pip install --upgrade pip
....
pip3 install peft trl
...
git clone https://github.com/huggingface/optimum-neuron.git
cd optimum-neuron
pip3 install .
....
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
neuronx-cc 2.13.66.0+6dfecc895 requires protobuf<3.20, but you have protobuf 3.20.3 which is incompatible.
....
#!/bin/bash
set -ex

export NEURON_FUSE_SOFTMAX=1
export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
export MALLOC_ARENA_MAX=64
export NEURON_CC_FLAGS="--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/"

PROCESSES_PER_NODE=8

NUM_EPOCHS=1
TP_DEGREE=2
PP_DEGREE=1
BS=1
GRADIENT_ACCUMULATION_STEPS=8
LOGGING_STEPS=1
MODEL_NAME="meta-llama/Meta-Llama-3-8B"
OUTPUT_DIR=output-$SLURM_JOB_ID

if [ "$NEURON_EXTRACT_GRAPHS_ONLY" = "1" ]; then
    MAX_STEPS=$((LOGGING_STEPS + 5))
else
    MAX_STEPS=-1
fi


XLA_USE_BF16=1 neuron_parallel_compile torchrun --nproc_per_node $PROCESSES_PER_NODE docs/source/training_tutorials/sft_lora_finetune_llm.py \
  --model_id $MODEL_NAME \
  --num_train_epochs $NUM_EPOCHS \
  --do_train \
  --learning_rate 5e-5 \
  --warmup_ratio 0.03 \
  --max_steps $MAX_STEPS \
  --per_device_train_batch_size $BS \
  --per_device_eval_batch_size $BS \
  --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS \
  --gradient_checkpointing true \
  --bf16 \
  --zero_1 false \
  --tensor_parallel_size $TP_DEGREE \
  --pipeline_parallel_size $PP_DEGREE \
  --logging_steps $LOGGING_STEPS \
  --save_total_limit 1 \
  --output_dir $OUTPUT_DIR \
  --lr_scheduler_type "constant" \
  --overwrite_output_dir
....
+ export NEURON_FUSE_SOFTMAX=1
+ NEURON_FUSE_SOFTMAX=1
+ export NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
+ NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3
+ export MALLOC_ARENA_MAX=64
+ MALLOC_ARENA_MAX=64
+ export 'NEURON_CC_FLAGS=--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
+ NEURON_CC_FLAGS='--model-type=transformer --distribution-strategy=llm-training --enable-saturate-infinity --cache_dir=/home/ubuntu/cache_dir_neuron/'
+ PROCESSES_PER_NODE=8
+ NUM_EPOCHS=1
+ TP_DEGREE=2
+ PP_DEGREE=1
+ BS=1
+ GRADIENT_ACCUMULATION_STEPS=8
+ LOGGING_STEPS=1
+ MODEL_NAME=meta-llama/Meta-Llama-3-8B
+ OUTPUT_DIR=output-
+ '[' '' = 1 ']'
+ MAX_STEPS=-1
+ XLA_USE_BF16=1
+ neuron_parallel_compile torchrun --nproc_per_node 8 docs/source/training_tutorials/sft_lora_finetune_llm.py --model_id meta-llama/Meta-Llama-3-8B --num_train_epochs 1 --do_train --learning_rate 5e-5 --warmup_ratio 0.03 --max_steps -1 --per_device_train_batch_size 1 --per_device_eval_batch_size 1 --gradient_accumulation_steps 8 --gradient_checkpointing true --bf16 --zero_1 false --tensor_parallel_size 2 --pipeline_parallel_size 1 --logging_steps 1 --save_total_limit 1 --output_dir output- --lr_scheduler_type constant --overwrite_output_dir
Traceback (most recent call last):
  File "/usr/local/bin/neuron_parallel_compile", line 5, in <module>
    from optimum.neuron.utils.neuron_parallel_compile import main
  File "/usr/local/lib/python3.10/site-packages/optimum/neuron/utils/neuron_parallel_compile.py", line 8, in <module>
    from torch_neuronx.parallel_compile.neuron_parallel_compile import LOGGER as torch_neuronx_logger
ModuleNotFoundError: No module named 'torch_neuronx.parallel_compile'

I tried to grab the neuron drivers:

echo 'deb https://apt.repos.neuron.amazonaws.com jammy main' > /etc/apt/sources.list.d/neuron.list
wget -qO - https://apt.repos.neuron.amazonaws.com/GPG-PUB-KEY-AMAZON-AWS-NEURON.PUB | apt-key add - && apt-get update
apt-get install -y aws-neuronx-collectives=2.* aws-neuronx-runtime-lib=2.* aws-neuronx-tools=2.*
echo "export PATH=/opt/aws/neuron/bin:\$PATH" >> /root/.bashrc
PATH="${PATH}:/opt/aws/neuron/bin"

and python -c "import torch_neuronx" runs with no errors but no help

I then removed neuron_parallel_compile and got:
...
Traceback (most recent call last):
File "", line 1027, in _find_and_load
File "/home/ubuntu/optimum-neuron/docs/source/training_tutorials/sft_lora_finetune_llm.py", line 11, in
File "", line 1006, in _find_and_load_unlocked
from optimum.neuron import NeuronHfArgumentParser as HfArgumentParser
File "/usr/local/lib/python3.10/site-packages/optimum/neuron/init.py", line 18, in
File "", line 688, in _load_unlocked
from .trainers import Seq2SeqTrainiumTrainer, TrainiumTrainer
File "/usr/local/lib/python3.10/site-packages/optimum/neuron/trainers.py", line 20, in
File "", line 883, in exec_module
File "", line 241, in _call_with_frames_removed
from transformers import Seq2SeqTrainer, Trainer
File "", line 1075, in _handle_fromlist
File "/usr/local/lib/python3.10/site-packages/transformers/trainer_seq2seq.py", line 26, in
File "/usr/local/lib/python3.10/site-packages/transformers/utils/import_utils.py", line 1462, in getattr
from .trainer import Trainer
File "/usr/local/lib/python3.10/site-packages/transformers/trainer.py", line 180, in
import torch_xla.distributed.spmd as xs
ModuleNotFoundError: No module named 'torch_xla.distributed.spmd'
...

So I tried reinstall

pip install torch-neuronx optimum[neuron] transformers

and still got the same ModuleNotFoundError: No module named 'torch_xla.distributed.spmd' error

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants