Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in splitting indices in distributed_sampling.py #9744

Closed
ryoji-kubo opened this issue Oct 29, 2024 · 0 comments · Fixed by #9753
Closed

Error in splitting indices in distributed_sampling.py #9744

ryoji-kubo opened this issue Oct 29, 2024 · 0 comments · Fixed by #9753
Labels

Comments

@ryoji-kubo
Copy link
Contributor

🐛 Describe the bug

The example code examples/multi_gpu/distributed_sampling.py of using distributed sampling with multiple GPUs has a bug in the splitting of the indices.

# Split indices into `world_size` many chunks:
train_idx = data.train_mask.nonzero(as_tuple=False).view(-1)
train_idx = train_idx.split(train_idx.size(0) // world_size)[rank]

This code does floor division to split train_idx to world_size chunks, but since this is a floor division, it can create world_size + 1 chunks. To fix this issue, use the ceiling division.

import math

# Split indices into `world_size` many chunks:
train_idx = data.train_mask.nonzero(as_tuple=False).view(-1)
train_idx = train_idx.split(math.ceil(train_idx.size(0) / world_size))[rank]

This ensures that we have world_size chunks.

Versions

Versions of relevant libraries:
[pip3] numpy==1.26.4
[pip3] torch==2.1.0
[pip3] torch_geometric==2.4.0
[pip3] torch-scatter==2.1.2+pt21cu118
[pip3] torch-sparse==0.6.18+pt21cu118
[pip3] torchaudio==2.1.0
[pip3] torchvision==0.16.0
[pip3] triton==2.1.0
[conda] blas 1.0 mkl
[conda] cuda-cudart 11.8.89 0 nvidia
[conda] cuda-cupti 11.8.87 0 nvidia
[conda] cuda-libraries 11.8.0 0 nvidia
[conda] cuda-nvrtc 11.8.89 0 nvidia
[conda] cuda-nvtx 11.8.86 0 nvidia
[conda] cuda-runtime 11.8.0 0 nvidia
[conda] cudatoolkit 11.8.0 h6a678d5_0
[conda] ffmpeg 4.3 hf484d3e_0 pytorch
[conda] libcublas 11.11.3.6 0 nvidia
[conda] libcufft 10.9.0.58 0 nvidia
[conda] libcurand 10.3.5.147 0 nvidia
[conda] libcusolver 11.4.1.48 0 nvidia
[conda] libcusparse 11.7.5.86 0 nvidia
[conda] libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
[conda] mkl 2023.1.0 h213fc3f_46344
[conda] mkl-service 2.4.0 py310h5eee18b_1
[conda] mkl_fft 1.3.10 py310h5eee18b_0
[conda] mkl_random 1.2.7 py310h1128e8f_0
[conda] numpy 1.26.4 pypi_0 pypi
[conda] pytorch 2.1.0 py3.10_cuda11.8_cudnn8.7.0_0 pytorch
[conda] pytorch-cuda 11.8 h7e8668a_6 pytorch
[conda] pytorch-mutex 1.0 cuda pytorch
[conda] torch-geometric 2.4.0 pypi_0 pypi
[conda] torch-scatter 2.1.2+pt21cu118 pypi_0 pypi
[conda] torch-sparse 0.6.18+pt21cu118 pypi_0 pypi
[conda] torchaudio 2.1.0 py310_cu118 pytorch
[conda] torchtriton 2.1.0 py310 pytorch
[conda] torchvision 0.16.0 py310_cu118 pytorch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant