Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MNT] MPS backend test failures on MacOS #1596

Closed
fkiraly opened this issue Aug 22, 2024 · 8 comments · Fixed by #1599 or #1648
Closed

[MNT] MPS backend test failures on MacOS #1596

fkiraly opened this issue Aug 22, 2024 · 8 comments · Fixed by #1599 or #1648
Labels
maintenance Continuous integration, unit testing & package distribution

Comments

@fkiraly
Copy link
Collaborator

fkiraly commented Aug 22, 2024

The CI fails with MPS backend failures on a number of tests:

RuntimeError: MPS backend out of memory (MPS allocated: 0 bytes, other allocations: 0 bytes, max allowed: 7.93 GB). Tried to allocate 256 bytes on shared pool. Use PYTORCH_MPS_HIGH_WATERMARK_RATIO=0.0 to disable upper limit for memory allocations (may cause system failure).
[W 2024-08-22 20:58:07,168] Trial 0 failed with value None.
@fkiraly fkiraly added the maintenance Continuous integration, unit testing & package distribution label Aug 22, 2024
@fkiraly fkiraly changed the title [MPS] MPS backend test failures [MNT] MPS backend test failures Aug 22, 2024
@fkiraly fkiraly changed the title [MNT] MPS backend test failures [MNT] MPS backend test failures on MacOS Aug 23, 2024
@fkiraly
Copy link
Collaborator Author

fkiraly commented Aug 23, 2024

Update: these seem to happen only on Mac

fkiraly pushed a commit that referenced this issue Aug 25, 2024
…ests`, MacOS MPS

Fixes #1594, fixes #1595, fixes #1596

Added or moved some dependencies to core dependency set.

Fixed some `numpy2` and `optuna-integrations` problems.

`requests` replaced by `urllib.request.urlretrieve`.
@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 3, 2024

happens on macos-latest but not on macos-13 - a temporary fix that @XinyuWuu discovered is to downgrade mac runners to macos-13

@fkiraly
Copy link
Collaborator Author

fkiraly commented Sep 3, 2024

The failure on macos-latest is here: #1633

@fnhirwa
Copy link
Member

fnhirwa commented Sep 4, 2024

This is an issue related to device handling on GPU for macos. Will open a PR fixing this generally.

@benHeid
Copy link
Collaborator

benHeid commented Sep 4, 2024

MPS errors can only happen to macOS since they are the shaders of the m Chips as @fnhirwa said.

This is probably caused by large neural networks that might perhaps run in parallel on the shaders. Fix could be to set the device to cpu for all tests (bottleneck now will become the normal ram) or alternatively reduce the ram.

A more complicated solution could also be to check if we can control the parallel execution of tests so that no neural networks are run in parallel to others but only in parallel with simpler models.

@fnhirwa
Copy link
Member

fnhirwa commented Sep 4, 2024

Given that this is a resource issue PyTorch helps to set an environment variable that falls back to CPU via PYTORCH_ENABLE_MPS_FALLBACK=1 when there is overutilization of the mps device. We can use monkeypatch fixture to set this variable in the tests.

I am adding the changes to #1648 to see if it works.

@XinyuWuu
Copy link
Member

XinyuWuu commented Sep 5, 2024

A more complicated solution could also be to check if we can control the parallel execution of tests so that no neural networks are run in parallel to others but only in parallel with simpler models.

We can do it by using a filelock as a fixture. I have tried it in sktime/sktime#6774.

@XinyuWuu
Copy link
Member

XinyuWuu commented Sep 5, 2024

It's caused by lack of nested-virtualization support:
https://docs.github.com/en/actions/using-github-hosted-runners/using-github-hosted-runners/about-github-hosted-runners#limitations-for-arm64-macos-runners
actions/runner-images#9254 (comment)
actions/runner-images#9918

My tests:
https://github.com/jdb78/pytorch-forecasting/actions/runs/10714861256/job/29709292560?pr=1654
https://github.com/jdb78/pytorch-forecasting/actions/runs/10714737635/job/29708945795?pr=1654

torch.backends.mps.is_available() returns true in macos-latest but it should return false.

@fnhirwa I am afraid PYTORCH_ENABLE_MPS_FALLBACK won't help. It enables fallback for some operators such as aten::_slow_conv2d_forward but in our case MPS is totally unusable.

We need to find a way to make torch.backends.mps.is_available() return false in macos-latest.

Unfortunately, we do not have something like CUDA_VISIBLE_DEVICES for MPS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Continuous integration, unit testing & package distribution
Projects
None yet
4 participants