Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MPS support with fused kernels #76

Merged
merged 20 commits into from
Apr 15, 2024

Conversation

andrewkchan
Copy link
Contributor

This PR adds support for GPU acceleration via the MPS backend on MacOS per #60.

  • I ported the custom gsplat PyTorch ops with fused kernels for gaussian projection, rasterization, etc. to metal performance shaders.
  • I tested the app compiled with MPS support on the banana example. It works!
  • I also tested using the unit tests in the gsplat repo (I will probably upstream these changes at some point). The backward pass jacobians for some ops are sometimes a little off the pure PyTorch jacobians but it doesn't seem to make a difference when testing with real examples.

Here's the speedup on my M3 Pro with MacOS Sonoma 14.3. Wall clock goes from 5 minutes to 5 seconds!

GPU

time ./opensplat ~/Downloads/banana/ -n 100
./opensplat ~/Downloads/banana/ -n 100  3.84s user 1.54s system 90% cpu 5.920 total

CPU

time ./opensplat ~/Downloads/banana/ -n 100 --cpu
./opensplat ~/Downloads/banana/ -n 100 --cpu  647.85s user 93.12s system 227% cpu 5:26.11 total

Some implementation notes:

  • I added some logs I found useful while developing indicating when resources are loaded and printing errors. I couldn't find any practices on different log levels (debug, info, warn, error, etc.) so right now everything is visible to the user.
  • Global resources like compute pipeline state objects and command queues are allocated lazily then leak forever for the lifetime of the program, I think this is ok because we're only allocating a very small fixed number of these and they'll be cleaned up when the application exits.

It was very useful to generate an XCode project from the CMakeLists.txt for debugging this for future reference, since XCode provides some nice GPU tools.

@andrewkchan andrewkchan mentioned this pull request Apr 14, 2024
@pfxuan
Copy link
Collaborator

pfxuan commented Apr 14, 2024

@andrewkchan Thank you for your fantastic work in making the MPS backend finally work! I did a quick test for this PR and found it works almost faultlessly from my laptop. My test setup is based on a customized M2 Max (30 of the 38 GPU cores).

CMake and XCode build:

cmake -G Xcode -DCMAKE_PREFIX_PATH=../../libtorch -DGPU_RUNTIME=MPS -DCMAKE_BUILD_TYPE=Release -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON  ../

./opensplat ~/Data/banana -n 100  3.76s user 1.75s system 100% cpu 5.465 total

CMake and make build:

cmake -DCMAKE_PREFIX_PATH=../../libtorch -DGPU_RUNTIME=MPS -DCMAKE_BUILD_TYPE=Release -DOPENSPLAT_BUILD_SIMPLE_TRAINER=ON  ../

./opensplat ~/Data/banana -n 100  4.05s user 1.78s system 101% cpu 5.741 total

Perhaps this PR is the first time to make it become possible to natively running 3DGS workload on apple silicon. I'm super excited for this new feature and expecting it would add a new impact in this area soon 🚀

@josephldobson
Copy link

Great work! very excited to start using this :)

@pierotofy
Copy link
Owner

pierotofy commented Apr 14, 2024

Agree! Very exciting @andrewkchan, thanks for the PR. I can't wait to run this on my M1 🙏

I will test/review this sometimes by today or tomorrow.

Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed size_t pixIdx = (i * width + j), did you find some issues with the CPU implementation while working on this?

@pierotofy
Copy link
Owner

Currently getting this error while trying to run:

...
init_gsplat_metal_context: load function rasterize_backward_kernel with label: (null)
init_gsplat_metal_context: error: load pipeline error: Error Domain=AGXMetal13_3 Code=3 "Compiler encountered an internal error" UserInfo={NSLocalizedDescription=Compiler encountered an internal error}

Also noticed that the definition for rasterize_forward_kernel_cpso might be missing?

@pfxuan
Copy link
Collaborator

pfxuan commented Apr 14, 2024

The error might be coupled with your macos version (13.3 vs 14.4). I was able to run through metal compile with the latest 14.4.1. Maybe we should consider tweaking cmake like ggerganov/llama.cpp#6370 did.

Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now.

@pierotofy
Copy link
Owner

Yep, I'm on 13.2. I'll try a few things, see if I can get it to compile.

Also, saw that we use nd_rasterize_forward_kernel_cpso for the rasterize forward pass, so that explains why rasterize_forward_kernel_cpso is not included. Makes sense.

@andrewkchan
Copy link
Contributor Author

andrewkchan commented Apr 14, 2024

Quick question, I noticed you made some changes in RasterizeGaussiansCPU and changed size_t pixIdx = (i * width + j), did you find some issues with the CPU implementation while working on this?

Hmm, I don't remember changing this code. Looks like it was from the experimental commit that I based my changes on 472a45a

Also, saw that we use nd_rasterize_forward_kernel_cpso for the rasterize forward pass, so that explains why rasterize_forward_kernel_cpso is not included. Makes sense.

Yeah, I had ported over the ND rasterize function instead of the rasterize by accident but then decided to just use that. It's possible this was causing the slight numerical differences in unit tests. Happy to port over the rasterize_forward_kernel if needed.

Also, it seems like there is a memory leaking problem on metal and I'm trying to resolve it now.

Curious what problem you are running into! Since as noted in the OP I'm intentionally leaking some resources forever.

@pfxuan
Copy link
Collaborator

pfxuan commented Apr 14, 2024

The memory consumption keeps growing when increasing the training iter num. e.g. 2000 iters will accumulate 12+ GB GPU memory.

I added emptyCache for MPS device. But it seems like there is no visible improvement.

#elif defined(USE_MPS)
#include <torch/mps.h>

        if (device != torch::kCPU){
            #ifdef USE_HIP
                    c10::hip::HIPCachingAllocator::emptyCache();
            #elif defined(USE_CUDA)
                    c10::cuda::CUDACachingAllocator::emptyCache();
            #elif defined(USE_MPS)
                    at::detail::getMPSHooks().emptyCache();
            #endif
        }

Also, adding or enforcing autoreleasepool across all methods in gsplat_metal.mm doesn't have any help so far. The root cause is probably linked with the deep copy of mps tensor:
image

@pierotofy
Copy link
Owner

Happy to port over the rasterize_forward_kernel if needed.

No need, but could certainly be done as part of another PR.

I'm still trying to compile this on 13.2; I've isolated the problem to a call to simd_shuffle_and_fill_down, but it's strange since my machine should have Metal 3 and that function has been available since 2.4.. investigating.

@andrewkchan
Copy link
Contributor Author

What is the expected memory usage for something like the banana example? It's not great if there is a leak. But I'm not able to find anything using the XCode "Leaks" tool except for two allocations of 128 byte objects. And I thought that memory usage is generally expected to increase over training because the number of gaussians will increase with scheduled splits.

@pfxuan
Copy link
Collaborator

pfxuan commented Apr 14, 2024

We can use this table as a baseline #3 (comment)

@pfxuan
Copy link
Collaborator

pfxuan commented Apr 14, 2024

For 2000 iters, the combined memory consumption is around 5.5 GB (CPU: 4.1 GB, GPU: 1.4 GB) from cuda version:
image

@andrewkchan
Copy link
Contributor Author

andrewkchan commented Apr 14, 2024

Here's what I'm seeing around 1900 iters on the banana (8.7GB):
Screenshot 2024-04-14 at 2 27 23 PM

I wasn't aware of the at::detail::getMPSHooks() you suggested. Curious where you found that! Interestingly, I tried printing at::detail::getMPSHooks().getCurrentAllocatedMemory() and at::detail::getMPSHooks().getDriverAllocatedMemory() and the numbers are way off. I wonder if that interface is not managing what we want.

@pierotofy
Copy link
Owner

pierotofy commented Apr 14, 2024

I ended up upgrading to 14.4 and it now runs 🥳

I think there might be something off with the rasterize forward pass however, this is the result of the metal renderer after 100 iters on the banana dataset (you can do even just 10 iters):

100

./opensplat -n 100 --val-render render ./banana
Step 100: 0.326634

Compared to the CPU run:

100

./opensplat -n 100 --val-render render ./banana --cpu
Step 100: 0.188744

Looks like a width/height mismatch. I had these issues when writing the CPU rasterizer, I recommend using the simple_trainer app with different --width and --height parameters to test (by default they are equal), e.g. ./simple_trainer --width 256 --height 128 --render render

4

@andrewkchan
Copy link
Contributor Author

Nice catch! You are totally right. Fixed and the loss is much lower after 100 iters now - 0.15966 vs 0.286877 for me.

@pfxuan
Copy link
Collaborator

pfxuan commented Apr 14, 2024

Here's what I'm seeing around 1900 iters on the banana (8.7GB): Screenshot 2024-04-14 at 2 27 23 PM

I wasn't aware of the at::detail::getMPSHooks() you suggested. Curious where you found that! Interestingly, I tried printing at::detail::getMPSHooks().getCurrentAllocatedMemory() and at::detail::getMPSHooks().getDriverAllocatedMemory() and the numbers are way off. I wonder if that interface is not managing what we want.

I'm thinking the size of the combined memory we got from mps backend is likely within the correct range. With emptyCache(), I was able to achieve a slightly lower memory footprint (7.8GB vs 8.7GB). The latest Pytorch doesn't provide a corresponding C++ API (such as c10::mps::MPSCachingAllocator::emptyCache()) to explicitly release the MPS cache. Current workaround (at::detail::getMPSHooks().emptyCache()) was discovered from this python API and I'm still uncertain if it's truly effective.

image

@pfxuan
Copy link
Collaborator

pfxuan commented Apr 14, 2024

Never mind, the memory footprint bumped to 9.1 GB (2k) after changing the the iter num (3k to 30K)

./opensplat ~/Data/banana -n 30000  228.13s user 12.47s system 81% cpu 4:53.68 total
image

@pierotofy
Copy link
Owner

pierotofy commented Apr 15, 2024

Nice! This is looking pretty amazing and can be merged. Thanks for everyone's help. 👍

Memory improvements, as well as a possible port of the rasterize_forward_kernel_cpso function can be done as part of a separate contribution. About the latter, I noticed that the CPU rasterizer tends to converge a bit faster on the banana dataset (with downscale: 2, loss: ~0.15 after 100 iters vs 0.17), not sure if it's just random, but might be worth to investigate.

@pierotofy pierotofy merged commit 373b337 into pierotofy:main Apr 15, 2024
29 checks passed
This was referenced Apr 15, 2024
@pfxuan
Copy link
Collaborator

pfxuan commented Apr 15, 2024

Great! I'll enable the MPS backend CI build via another PR. From my latest test, the combined memory footprint reaches 30.6 GB at the step 8800:

./opensplat ~/Data/banana -n 30000  2923.88s user 104.37s system 73% cpu 1:08:21.08 total
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants