Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use AWS Neuron sdk 2.18 #547

Merged
merged 8 commits into from
Apr 8, 2024
Merged

Use AWS Neuron sdk 2.18 #547

merged 8 commits into from
Apr 8, 2024

Conversation

dacorvo
Copy link
Collaborator

@dacorvo dacorvo commented Apr 2, 2024

What does this PR do?

This pull-request bumps the AWS Neuron SDK version to 2.18.

It also bumps the TGI router version to 1.4.4 to fix build issues due to underlying rust packages updates.

To update your local host, do the following:

$ sudo apt update
$ sudo apt install -u aws-neuronx-dkms aws-neuronx-runtime-lib aws-neuronx-collectives aws-neuronx-tools
$ pip install -U neuronx-cc torch-neuronx==1.13.* transformers-neuronx neuronx-distributed

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@dacorvo
Copy link
Collaborator Author

dacorvo commented Apr 2, 2024

@JingyaHuang the neuronx SD cache test is failing.

@dacorvo
Copy link
Collaborator Author

dacorvo commented Apr 2, 2024

@michaelbenayoun some trainium tests are failing. It may be related to changes in the way neuronx-distributed loads weights (safetensors related error messages).

@dacorvo dacorvo force-pushed the aws_neuron_sdk_2.18 branch 2 times, most recently from 74d9cba to e5a72b5 Compare April 3, 2024 07:13
@dacorvo
Copy link
Collaborator Author

dacorvo commented Apr 5, 2024

@michaelbenayoun there is a newly failing distributed test with AWS 2.18, probably after your latest changes.

@michaelbenayoun
Copy link
Member

It's weird that it fails.
The error is:

2024-04-04T16:56:03.5030743Z self = <tests.distributed.test_model_parallelization.TestModelParallelization object at 0x7f8211855c40>
2024-04-04T16:56:03.5031331Z 
2024-04-04T16:56:03.5031567Z     def _terminate_xrt_server(self):
2024-04-04T16:56:03.5032077Z         xrt_server_str = "torch_neuronx.distributed._xrt_run_server"
2024-04-04T16:56:03.5032964Z         startmethod = mp.get_start_method(allow_none=True)
2024-04-04T16:56:03.5033499Z         # Rules:
2024-04-04T16:56:03.5034194Z         # - `startmethod is None`: the XRT server tracks pytest's PID.
2024-04-04T16:56:03.5035254Z         # - `startmethod="spawn"`: the parent process of the pool's processes is pytest, so the XRT server tracks
2024-04-04T16:56:03.5036140Z         # pytest's PID.
2024-04-04T16:56:03.5036736Z         # - `startmethod="fork"`: same as `startmethod="spawn"`.
2024-04-04T16:56:03.5037740Z         # - `startmethod="forkserver"`: the parent process of the pool's processes is the forkserver, so the XRT server tracks
2024-04-04T16:56:03.5038751Z         # the forkserver's PID.
2024-04-04T16:56:03.5039261Z         if startmethod == "forkserver":
2024-04-04T16:56:03.5039889Z             target_pid = multiprocessing.forkserver._forkserver._forkserver_pid
2024-04-04T16:56:03.5040480Z         else:
2024-04-04T16:56:03.5040812Z             target_pid = os.getpid()
2024-04-04T16:56:03.5041254Z     
2024-04-04T16:56:03.5041654Z         for p in psutil.process_iter():
2024-04-04T16:56:03.5042067Z             try:
2024-04-04T16:56:03.5042558Z                 if "python3" in p.name() and len(p.cmdline()) == 7:
2024-04-04T16:56:03.5043330Z                     cmdline = p.cmdline()
2024-04-04T16:56:03.5044098Z >                   if cmdline[2] == xrt_server_str and cmdline[-1] == str(target_pid):
2024-04-04T16:56:03.5044865Z E                   IndexError: list index out of range
2024-04-04T16:56:03.5045282Z 

It might be linked to torch_neuronx not relying on XRT now. But why did it fail after so many tests. Let me check with the Annapurna team.

@dacorvo dacorvo marked this pull request as ready for review April 8, 2024 07:04
@dacorvo
Copy link
Collaborator Author

dacorvo commented Apr 8, 2024

No regression found but the SDXL separate weights: let's merge this !

Copy link
Collaborator

@JingyaHuang JingyaHuang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, let's merge it! Thanks for taking care of the upgrade.

@dacorvo dacorvo merged commit 09ddd67 into main Apr 8, 2024
14 checks passed
@dacorvo dacorvo deleted the aws_neuron_sdk_2.18 branch April 8, 2024 07:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants