Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker build on a100 gpu libtorch cuda error #90

Closed
embercult opened this issue May 4, 2024 · 4 comments
Closed

Docker build on a100 gpu libtorch cuda error #90

embercult opened this issue May 4, 2024 · 4 comments

Comments

@embercult
Copy link

embercult commented May 4, 2024

root@f629ddabf1f2:/code/build# ./opensplat /data/hulk/
Using CUDA
Reading 12149 points
Loading /data/hulk/images/0039.jpg
Loading /data/hulk/images/0009.jpg
Loading /data/hulk/images/0035.jpg
Loading Loading Loading Loading /data/hulk/images/0001.jpg
Loading Loading Loading /data/hulk/images/0007.jpg
Loading Loading /data/hulk/images/0005.jpg
/data/hulk/images/0021.jpgLoading
/data/hulk/images/0025.jpg
Loading Loading /data/hulk/images/0033.jpgLoading Loading
Loading Loading /data/hulk/images/0019.jpg/data/hulk/images/0029.jpg/data/hulk/images/0015.jpg
/data/hulk/images/0003.jpg
/data/hulk/images/0037.jpg/data/hulk/images/0017.jpgLoading /data/hulk/images/0023.jpg


/data/hulk/images/0031.jpg
/data/hulk/images/0011.jpg
/data/hulk/images/0013.jpg


/data/hulk/images/0027.jpg
Loading /data/hulk/images/0038.jpg
Loading /data/hulk/images/0022.jpg
Loading /data/hulk/images/0018.jpg
Loading /data/hulk/images/0002.jpg
Loading /data/hulk/images/0024.jpg
Loading /data/hulk/images/0008.jpg
Loading /data/hulk/images/0006.jpg
Loading /data/hulk/images/0014.jpg
Loading /data/hulk/images/0016.jpg
Loading /data/hulk/images/0036.jpg
Loading /data/hulk/images/0010.jpg
Loading /data/hulk/images/0030.jpg
Loading /data/hulk/images/0026.jpg
Loading /data/hulk/images/0032.jpg
Loading /data/hulk/images/0004.jpg
Loading /data/hulk/images/0020.jpg
Loading /data/hulk/images/0028.jpg
Loading /data/hulk/images/0034.jpg
Loading /data/hulk/images/0012.jpg
CUDA error: no kernel image is available for execution on the device
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.

Exception raised from c10_cuda_check_implementation at ../c10/cuda/CUDAException.cpp:44 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0x6c (0x7fbae7f03a0c in /code/libtorch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xfa (0x7fbae7ead8bc in /code/libtorch/lib/libc10.so)
frame #2: c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) + 0x3cc (0x7fbae7b1001c in /code/libtorch/lib/libc10_cuda.so)
frame #3: void at::native::gpu_kernel_impl<__nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 12u>, long (long)> >(at::TensorIteratorBase&, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 12u>, long (long)> const&) + 0x4bf (0x7fba7db8834f in /code/libtorch/lib/libtorch_cuda.so)
frame #4: void at::native::gpu_kernel<__nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 12u>, long (long)> >(at::TensorIteratorBase&, __nv_hdl_wrapper_t<false, true, false, __nv_dl_tag<void (*)(at::TensorIteratorBase&), &at::native::direct_copy_kernel_cuda, 12u>, long (long)> const&) + 0x34b (0x7fba7db888eb in /code/libtorch/lib/libtorch_cuda.so)
frame #5: at::native::direct_copy_kernel_cuda(at::TensorIteratorBase&) + 0x39c (0x7fba7db6fd7c in /code/libtorch/lib/libtorch_cuda.so)
frame #6: at::native::copy_device_to_device(at::TensorIterator&, bool, bool) + 0xcbd (0x7fba7db70bcd in /code/libtorch/lib/libtorch_cuda.so)
frame #7: <unknown function> + 0x1ae7bb2 (0x7fba7db72bb2 in /code/libtorch/lib/libtorch_cuda.so)
frame #8: <unknown function> + 0x1cf4596 (0x7fbad14e4596 in /code/libtorch/lib/libtorch_cpu.so)
frame #9: at::native::copy_(at::Tensor&, at::Tensor const&, bool) + 0x7a (0x7fbad14e5e3a in /code/libtorch/lib/libtorch_cpu.so)
frame #10: at::_ops::copy_::call(at::Tensor&, at::Tensor const&, bool) + 0x16f (0x7fbad231571f in /code/libtorch/lib/libtorch_cpu.so)
frame #11: at::native::_to_copy(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) + 0x1b23 (0x7fbad1806303 in /code/libtorch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x2f5545f (0x7fbad274545f in /code/libtorch/lib/libtorch_cpu.so)
frame #13: at::_ops::_to_copy::redispatch(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) + 0x109 (0x7fbad1d674b9 in /code/libtorch/lib/libtorch_cpu.so)
frame #14: <unknown function> + 0x2d271fa (0x7fbad25171fa in /code/libtorch/lib/libtorch_cpu.so)
frame #15: at::_ops::_to_copy::call(at::Tensor const&, std::optional<c10::ScalarType>, std::optional<c10::Layout>, std::optional<c10::Device>, std::optional<bool>, bool, std::optional<c10::MemoryFormat>) + 0x1fe (0x7fbad1e0565e in /code/libtorch/lib/libtorch_cpu.so)
frame #16: at::native::to(at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>) + 0xc2 (0x7fbad17fdbd2 in /code/libtorch/lib/libtorch_cpu.so)
frame #17: <unknown function> + 0x31927b8 (0x7fbad29827b8 in /code/libtorch/lib/libtorch_cpu.so)
frame #18: at::_ops::to_dtype::call(at::Tensor const&, c10::ScalarType, bool, bool, std::optional<c10::MemoryFormat>) + 0x18b (0x7fbad1fc8d2b in /code/libtorch/lib/libtorch_cpu.so)
frame #19: <unknown function> + 0x1f2e610 (0x7fbad171e610 in /code/libtorch/lib/libtorch_cpu.so)
frame #20: <unknown function> + 0x1f2e70d (0x7fbad171e70d in /code/libtorch/lib/libtorch_cpu.so)
frame #21: at::native::structured_sum_out::impl(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>, at::Tensor const&) + 0x64 (0x7fbad171e814 in /code/libtorch/lib/libtorch_cpu.so)
frame #22: <unknown function> + 0x3573697 (0x7fba7f5fe697 in /code/libtorch/lib/libtorch_cuda.so)
frame #23: <unknown function> + 0x357375d (0x7fba7f5fe75d in /code/libtorch/lib/libtorch_cuda.so)
frame #24: at::_ops::sum_dim_IntList::call(at::Tensor const&, c10::OptionalArrayRef<long>, bool, std::optional<c10::ScalarType>) + 0x1d8 (0x7fbad218cf28 in /code/libtorch/lib/libtorch_cpu.so)
frame #25: at::native::sum(at::Tensor const&, std::optional<c10::ScalarType>) + 0x3e (0x7fbad17146ee in /code/libtorch/lib/libtorch_cpu.so)
frame #26: <unknown function> + 0x2f537e8 (0x7fbad27437e8 in /code/libtorch/lib/libtorch_cpu.so)
frame #27: at::_ops::sum::redispatch(c10::DispatchKeySet, at::Tensor const&, std::optional<c10::ScalarType>) + 0x8b (0x7fbad20dbacb in /code/libtorch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x492b508 (0x7fbad411b508 in /code/libtorch/lib/libtorch_cpu.so)
frame #29: <unknown function> + 0x492baab (0x7fbad411baab in /code/libtorch/lib/libtorch_cpu.so)
frame #30: at::_ops::sum::call(at::Tensor const&, std::optional<c10::ScalarType>) + 0x14d (0x7fbad218c96d in /code/libtorch/lib/libtorch_cpu.so)
frame #31: <unknown function> + 0x92ee9 (0x55eb1efb1ee9 in ./opensplat)
frame #32: <unknown function> + 0x33031 (0x55eb1ef52031 in ./opensplat)
frame #33: __libc_start_main + 0xf3 (0x7fba7afb0083 in /usr/lib/x86_64-linux-gnu/libc.so.6)
frame #34: <unknown function> + 0x3526e (0x55eb1ef5426e in ./opensplat)
@pfxuan
Copy link
Collaborator

pfxuan commented May 4, 2024

It seems like the docker image was built with an unmatched CUDA compute capability. To support A100 GPU, you can add TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0"

docker build \
  -t opensplat:ubuntu-22.04-cuda-12.1.1-torch-2.2.1 \
  --build-arg UBUNTU_VERSION=22.04 \
  --build-arg CUDA_VERSION=12.1.1 \
  --build-arg TORCH_VERSION=2.2.1 \
  --build-arg TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0" \
  --build-arg CMAKE_BUILD_TYPE=Release .

@embercult
Copy link
Author

I had added that still got this issue

@pfxuan
Copy link
Collaborator

pfxuan commented May 6, 2024

I've created PR #91 in hopes that it can resolve your build issue. With the new update, you can test this build method:

Replace

--build-arg TORCH_CUDA_ARCH_LIST="7.0;7.5;8.0"

With

--build-arg CMAKE_CUDA_ARCHITECTURES="70;75;80"

@pfxuan
Copy link
Collaborator

pfxuan commented May 16, 2024

Closed via #91

@pfxuan pfxuan closed this as completed May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants