Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD GPU Support via HIP #4

Closed
pierotofy opened this issue Feb 16, 2024 · 13 comments
Closed

AMD GPU Support via HIP #4

pierotofy opened this issue Feb 16, 2024 · 13 comments
Labels
enhancement New feature or request

Comments

@pierotofy
Copy link
Owner

It should be possible to port the CUDA code to HIP and run this on AMD GPUs as well.

@pierotofy pierotofy added the enhancement New feature or request label Feb 16, 2024
@pfxuan
Copy link
Collaborator

pfxuan commented Mar 1, 2024

@pierotofy Do you have a high level suggestion to the path of porting CUDA code to AMD HIP? I'm planning to give it a quick try with ZLUDA. Any thoughts?

@pierotofy
Copy link
Owner Author

I personally would start from https://rocmdocs.amd.com/projects/HIPIFY/en/latest/ and https://rocm.docs.amd.com/projects/HIP/en/latest/

ZLUDA seems interesting too, but I haven't tried it.

@pierotofy
Copy link
Owner Author

Quick update: HIP is missing some functions that are used in gsplat's CUDA code. This will require some manual changes as an automatic port does not seem feasible.

@pfxuan
Copy link
Collaborator

pfxuan commented Mar 6, 2024

hipify-clang seems more robust than hipify-perl to port CUDA code. I was able to run through these 3 cuda code translation after carefully tweaking cuda, rocm, and clang/llvm versions:

os: [ubuntu-22.04]
torch-version: [2.2.1]
cuda-version: [11.8.0]
rocm-version: [5.6.1]
llvm-version: [16]

[HIPIFY] info: file './vendor/gsplat/backward.cu' statistics:
  CONVERTED refs count: 24
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 22
  TOTAL bytes: 18934
  CHANGED lines of code: 2
  TOTAL lines of code: 502
  CODE CHANGED (in bytes) %: 0.1
  CODE CHANGED (in lines) %: 0.4
  TIME ELAPSED s: 1.63
[HIPIFY] info: CONVERTED refs by type:
  device_function: 23
  include: 1
[HIPIFY] info: CONVERTED refs by API:
  CUDA RT API: 24
[HIPIFY] info: CONVERTED refs by names:
  __expf: 2
  atomicAdd: 16
  cooperative_groups.h: 1
  max: 1
  min: 4

[HIPIFY] info: file './vendor/gsplat/bindings.cu' statistics:
  CONVERTED refs count: 4
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 75
  TOTAL bytes: 20985
  CHANGED lines of code: 4
  TOTAL lines of code: 624
  CODE CHANGED (in bytes) %: 0.4
  CODE CHANGED (in lines) %: 0.6
  TIME ELAPSED s: 14.87
[HIPIFY] info: CONVERTED refs by type:
  include: 2
  include_cuda_main_header: 2
[HIPIFY] info: CONVERTED refs by API:
  CUDA Driver API: 1
  CUDA RT API: 3
[HIPIFY] info: CONVERTED refs by names:
  cooperative_groups.h: 1
  cuda.h: 1
  cuda_runtime.h: 1
  cuda_runtime_api.h: 1
  
  [HIPIFY] info: file './vendor/gsplat/forward.cu' statistics:
  CONVERTED refs count: 7
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 22
  TOTAL bytes: 16216
  CHANGED lines of code: 2
  TOTAL lines of code: 463
  CODE CHANGED (in bytes) %: 0.1
  CODE CHANGED (in lines) %: 0.4
  TIME ELAPSED s: 0.90
[HIPIFY] info: CONVERTED refs by type:
  device_function: 6
  include: 1
[HIPIFY] info: CONVERTED refs by API:
  CUDA RT API: 7
[HIPIFY] info: CONVERTED refs by names:
  __expf: 2
  __syncthreads_count: 1
  cooperative_groups.h: 1
  min: 3
[HIPIFY] info: file 'GLOBAL' statistics:
  CONVERTED refs count: 35
  UNCONVERTED refs count: 0
  CONVERSION %: 100.0
  REPLACED bytes: 119
  TOTAL bytes: 56135
  CHANGED lines of code: 8
  TOTAL lines of code: 1589
  CODE CHANGED (in bytes) %: 0.2
  CODE CHANGED (in lines) %: 0.5
  TIME ELAPSED s: 17.39
[HIPIFY] info: CONVERTED refs by type:
  device_function: 29
  include: 4
  include_cuda_main_header: 2
[HIPIFY] info: CONVERTED refs by API:
  CUDA Driver API: 1
  CUDA RT API: 34
[HIPIFY] info: CONVERTED refs by names:
  __expf: 4
  __syncthreads_count: 1
  atomicAdd: 16
  cooperative_groups.h: 3
  cuda.h: 1
  cuda_runtime.h: 1
  cuda_runtime_api.h: 1
  max: 1
  min: 7
[HIPIFY] info: TOTAL statistics:
  CONVERTED files: 3
  PROCESSED files: 3

Let's see if we can compile them into an executable file. To further test the compatibility and performance, I probably need to find a way to hook 2nd GPU (AMD Radeon RX 6700 XT) into my mini desktop.

@pierotofy
Copy link
Owner Author

pierotofy commented Mar 6, 2024

I had trouble with the functions in cooperative_groups/reduce.h, which don't seem to have an equivalent in HIP (but I used hipify-perl). I managed to compile it after making some manual changes, but had some crashes which I have yet to investigate.

Super experimental, proof of concept branch https://github.com/pierotofy/OpenSplat/tree/hippoc

@pfxuan
Copy link
Collaborator

pfxuan commented Mar 7, 2024

Unfortunately, hipify-clang generates a very similar code as hipify-perl did. But it seems like most functions can be directly mapped into HIP except cg::reduce() and cg::this_thread_block(). There should be a way to manually map them:

auto block = hipThreadIdx_x + hipBlockDim_x * hipBlockIdx_x;
int32_t tile_id =
    hipBlockIdx_y * tile_bounds.x + hipBlockIdx_x;
unsigned i =
    hipBlockIdx_y * hipBlockDim_y + hipThreadIdx_y;
unsigned j =
    hipBlockIdx_x * hipBlockDim_x + hipThreadIdx_x;

inline __device__ void warpSum(float& val, const int& tile){
    for (int offset = tile / 2; offset > 0; offset /= 2) {
        val += __shfl_down_sync(0xffffffff, val, offset);
    }
}

@pfxuan
Copy link
Collaborator

pfxuan commented Mar 7, 2024

BTW, ROCm HIP build pipeline is ready. What's the best way to integrate it with your working branch (hippoc)?

@pfxuan
Copy link
Collaborator

pfxuan commented Mar 10, 2024

@pierotofy Based-on your current work (hippoc), I created a new PR#36 to consolidate our updates. It would be great to have an actual test. From my end, it might take time for me to find a way to hook up an extra AMD GPU into my desktop. It already has a Nvidia 3080 ti up running and the chassis space is very tight right now.

BTW, if you can find some AMD/Nvidia GPU resources, building a self-hosted GPU runner could be a good option for daily build & benchmark tasks.

@pierotofy
Copy link
Owner Author

I don't have an AMD GPU either, but I was thinking of using https://github.com/ROCm/HIP-CPU to test.

@pfxuan
Copy link
Collaborator

pfxuan commented Mar 10, 2024

It's a pretty interesting project and could become useful for Mac port. Thank you for pointing it out.
If we want to benchmark OpenSplat performance on datacenter GPU, potentially AMD Accelerator Cloud (AAC) might be able to provide the required resources for us - https://github.com/amddcgpuce/AMDAcceleratorCloudGuides

@pierotofy
Copy link
Owner Author

pierotofy commented Mar 11, 2024

Found that HIP-CPU does not have the cooperative_groups.h header, so the entire cg namespace is missing, which is problematic, since there's a lot of references for that in the gsplat implementation.

For CPU support I would either have to refactor gsplat to not use cooperative groups or rewrite the entire thing for CPU (or wait and hope for HIP-CPU support: ROCm/HIP-CPU#60)

@pfxuan
Copy link
Collaborator

pfxuan commented Mar 16, 2024

Finally, swapped Nvidia with AMD GPU and almost ready for the battle test 😅
IMG_6239
IMG_6240
IMG_6243

@pfxuan
Copy link
Collaborator

pfxuan commented Mar 17, 2024

AMD GPU is now working 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants