Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is Tiledarray ready for SYCL conversion? #427

Open
victor-anisimov opened this issue Oct 19, 2023 · 7 comments
Open

Is Tiledarray ready for SYCL conversion? #427

victor-anisimov opened this issue Oct 19, 2023 · 7 comments

Comments

@victor-anisimov
Copy link
Collaborator

I'm trying to compile the latest commit d0c0ded of tiledarray for CUDA platform (V100) and get a compiler error:

[ 95%] Building CXX object src/CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o
cd /home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/src && /soft/packaging/spack-builds/linux-opensuse_leap15-x86_64/gcc-10.2.0/gcc-10.2.0-yudlyezca7twgd5o3wkkraur7wdbngdn/bin/g++ -DBOOST_ALL_NO_LIB -DBOOST_SERIALIZATION_DYN_LINK -DBTAS_ASSERT_THROWS=1 -DBTAS_DEFAULT_TARGET_MAX_INDEX_RANK=6 -DBTAS_HAS_BLAS_LAPACK=1 -DBTAS_HAS_BOOST_CONTAINER=1 -DBTAS_HAS_BOOST_ITERATOR=1 -DBTAS_HAS_BOOST_SERIALIZATION=1 -DBTAS_HAS_CBLAS=1 -DBTAS_HAS_INTEL_MKL=1 -DBTAS_HAS_LAPACKE=1 -DLAPACK_COMPLEX_CPP=1 -DLIBRETT_USES_CUDA=1 -DMADNESS_DISABLE_WORLD_GET_DEFAULT=1 -DMADNESS_LINALG_USE_LAPACKE=1 -DMADNESS_MPI_HEADER="/soft/restricted/CNDA/mpich/drop51.2/mpich-ofi-sockets-icc-default-gpu-drop51/include/mpi.h" -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX -D_MPICC_H -I/home/vanisimov/tiledarray/master/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/src -I/soft/compilers/cuda/cuda-12.0.0/targets/x86_64-linux/include -I/soft/restricted/CNDA/updates/2023.05.15.001/oneapi/mkl/2023u2_20230512/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-src/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-build/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-build/src/madness/world -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-src/src/madness/world -I/soft/restricted/CNDA/mpich/drop51.2/mpich-ofi-sockets-icc-default-gpu-drop51/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/eigen-src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/btas-src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/blaspp-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/blaspp-src/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/lapackpp-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/lapackpp-src/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src/tpl -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src/tpl/umpire/camp/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-build/src/tpl/umpire/camp/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/deps/umpire-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/deps/librett-src/src -isystem /soft/compilers/cuda/cuda-12.0.0/include -I/home/vanisimov/tiledarray/install/include -std=c++17 -D__TBB_LEGACY_MODE=1 -DTBB_SUPPRESS_DEPRECATED_MESSAGES=1 -I/home/vanisimov/tiledarray/install/include -O0 -g -Wall -ftemplate-backtrace-limit=0 -std=c++17 -fPIC -I/home/vanisimov/tiledarray/install/include -std=c++17 -D__TBB_LEGACY_MODE=1 -DTBB_SUPPRESS_DEPRECATED_MESSAGES=1 -fopenmp -MD -MT src/CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o -MF CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o.d -o CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o -c /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp
In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/utility.h:33,
from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:30,
from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:30,
from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26:
/home/vanisimov/tiledarray/master/src/TiledArray/tiled_range1.h: In static member function 'static TiledArray::TiledRange1 TiledArray::TiledRange1::make_uniform(std::size_t, std::size_t)':
/home/vanisimov/tiledarray/master/src/TiledArray/tiled_range1.h:268:41: warning: comparison of integer expressions of different signedness: 'long int' and 'std::size_t' {aka 'long unsigned int'} [-Wsign-compare]
268 | for (auto i = num_avg_plus_one; i < ntiles;
| ~~^~~~~~~~
In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26:
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h: In instantiation of 'TiledArray::Tensor<T, A>::Tensor(const T1&, const Perm&) [with T1 = TiledArray::Tensor; Perm = TiledArray::Permutation; typename std::enable_if<(is_nested_tensor_v && is_permutation_v)>::type* = 0; T = float; Allocator = Eigen::aligned_allocator]':
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:1279:16: required from 'TiledArray::Tensor<T, A> TiledArray::Tensor<T, A>::permute(const Perm&) const [with Perm = TiledArray::Permutation; = void; T = float; Allocator = Eigen::aligned_allocator]'
/home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:1201:44: required from 'TiledArray::SparseShape::SparseShape
TiledArray::SparseShape::perm(const TiledArray::Permutation&) const [with T = float; TiledArray::SparseShape::SparseShape
= TiledArray::SparseShape]'
/home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:30:16: required from here
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:404:24: error: no matching function for call to 'tensor_init(, TiledArray::Permutation, TiledArray::Tensor&, const TiledArray::Tensor&)'
404 | detail::tensor_init(value_converter, outer(perm),
| ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405 | this, other);
| ~~~~~~~~~~~~~
In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:30,
from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26:
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:449:13: note: candidate: 'template<class Op, class TR, class ... Ts, typename std::enable_if<(TiledArray::detail::is_tensor<T2, Ts ...>::value && TiledArray::detail::is_contiguous_tensor<T2, Ts ...>::value)>::type
> void TiledArray::detail::tensor_init(Op&&, TR&, const Ts& ...)'
449 | inline void tensor_init(Op&& op, TR& result, const Ts&... tensors) {
| ^~~~~~~~~~~
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:449:13: note: template argument deduction/substitution failed:
In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26:
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:404:24: note: couldn't deduce template parameter 'Op'
404 | detail::tensor_init(value_converter, outer(perm),
| ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405 | *this, other);
| ~~~~~~~~~~~~~

@evaleev
Copy link
Member

evaleev commented Oct 19, 2023

@victor-anisimov I think this is an issue due to a regression in gcc 10.2 ... do you have 10.3 available? That's the main compiler we use on our cluster, it should work.

@victor-anisimov
Copy link
Collaborator Author

gcc/11.1.0 worked for me. Thank you very much for the suggestion! However, I see many tests failing.

  Start  1: madness/test/world/build

1/26 Test #1: madness/test/world/build ................... Passed 53.98 sec
Start 2: madness/test/world/test_prof/run
2/26 Test #2: madness/test/world/test_prof/run ...........***Failed 1.19 sec
Start 3: madness/test/world/test_ar/run
3/26 Test #3: madness/test/world/test_ar/run .............***Failed 0.64 sec
Start 4: madness/test/world/test_hashdc/run
4/26 Test #4: madness/test/world/test_hashdc/run .........***Failed 0.64 sec
Start 5: madness/test/world/test_hello/run
5/26 Test #5: madness/test/world/test_hello/run ..........***Failed 0.64 sec
Start 6: madness/test/world/test_atomicint/run
6/26 Test #6: madness/test/world/test_atomicint/run ......***Failed 0.64 sec
Start 7: madness/test/world/test_future/run
7/26 Test #7: madness/test/world/test_future/run .........***Failed 0.64 sec
Start 8: madness/test/world/test_future2/run
8/26 Test #8: madness/test/world/test_future2/run ........***Failed 0.64 sec
Start 9: madness/test/world/test_future3/run
9/26 Test #9: madness/test/world/test_future3/run ........***Failed 0.64 sec
Start 10: madness/test/world/test_dc/run
10/26 Test #10: madness/test/world/test_dc/run .............***Failed 0.64 sec
Start 11: madness/test/world/test_hashthreaded/run
11/26 Test #11: madness/test/world/test_hashthreaded/run ...***Failed 0.64 sec
Start 12: madness/test/world/test_queue/run
12/26 Test #12: madness/test/world/test_queue/run ..........***Failed 0.63 sec
Start 13: madness/test/world/test_world/run
13/26 Test #13: madness/test/world/test_world/run ..........***Failed 0.66 sec
Start 14: madness/test/world/test_worldprofile/run
14/26 Test #14: madness/test/world/test_worldprofile/run ...***Failed 0.68 sec
Start 15: madness/test/world/test_binsorter/run
15/26 Test #15: madness/test/world/test_binsorter/run ......***Failed 0.64 sec
Start 16: madness/test/world/test_vector/run
16/26 Test #16: madness/test/world/test_vector/run ......... Passed 0.35 sec
Start 17: madness/test/world/test_worldptr/run
17/26 Test #17: madness/test/world/test_worldptr/run .......***Failed 0.92 sec
Start 18: madness/test/world/test_worldref/run
18/26 Test #18: madness/test/world/test_worldref/run .......***Failed 0.87 sec
Start 19: madness/test/world/test_stack/run
19/26 Test #19: madness/test/world/test_stack/run .......... Passed 0.12 sec
Start 20: madness/test/world/test_googletest/run
20/26 Test #20: madness/test/world/test_googletest/run ..... Passed 0.66 sec
Start 21: madness/test/world/test_tree/run
21/26 Test #21: madness/test/world/test_tree/run ...........***Failed 0.64 sec
Start 22: btas/unit/build
22/26 Test #22: btas/unit/build ............................ Passed 54.24 sec
Start 23: btas/unit/run
23/26 Test #23: btas/unit/run .............................. Passed 215.30 sec
Start 24: tiledarray/unit/build
24/26 Test #24: tiledarray/unit/build ...................... Passed 1082.45 sec
Start 25: tiledarray/unit/run-np-1
25/26 Test #25: tiledarray/unit/run-np-1 ...................***Failed 1.43 sec
Start 26: tiledarray/unit/run-np-2
26/26 Test #26: tiledarray/unit/run-np-2 ...................***Failed 0.75 sec

27% tests passed, 19 tests failed out of 26

@evaleev
Copy link
Member

evaleev commented Oct 19, 2023 via email

@victor-anisimov
Copy link
Collaborator Author

Yes, you are right! My MPI environment is broken. It crashes even on hello world. I'll rerun the tests on Polaris at some later time.

I have one more question, though. I saw the change in directory structure from cpu/cuda to host/device. Also a number of cuda suffixes in file names has been changed to suffix device, which is great. However, I still can see cpu_cuda_vector in src/TiledArray/device. Are you going to rename it?

For HIP/SYCL, you would also need to rename .cu files to .cpp and modify the CMake scripts accordingly
./src/TiledArray/device/cpu_cuda_vector.cu
./src/TiledArray/device/kernel/thrust/mult_kernel.cu
./src/TiledArray/device/kernel/thrust/reduce_kernel.cu
./src/TiledArray/device/um_storage.cu

Do you plan to make that change or keep the file extensions as they are? Thanks!

@evaleev
Copy link
Member

evaleev commented Oct 19, 2023

I have one more question, though. I saw the change in directory structure from cpu/cuda to host/device. Also a number of cuda suffixes in file names has been changed to suffix device, which is great. However, I still can see cpu_cuda_vector in src/TiledArray/device. Are you going to rename it?

the problem with cpu_cuda_vector is that it requires Thrust. Initially I was not sure whether Thrust would be a viable option. It seems that ROCm now provides it, not so sure what the SYCL story is on this front (it probably is just standard C++ there?).

In any case, since cpu_cuda_vector is not actually used I do not plan to maintain it (will revive if needed). Probably should be marked deprecated.

For HIP/SYCL, you would also need to rename .cu files to .cpp and modify the CMake scripts accordingly ./src/TiledArray/device/cpu_cuda_vector.cu ./src/TiledArray/device/kernel/thrust/mult_kernel.cu ./src/TiledArray/device/kernel/thrust/reduce_kernel.cu ./src/TiledArray/device/um_storage.cu

See above re: cpu_cuda_vector.cu. Same story for um_storage.cu actually: unused, can be deprecated.

The rest are CUDA-specific implementation files ... they are actually portable, but CMake requires a separate file for each device-specific implementation so notice I just include mult_kernel.cu into mult_kernel.hip (using implementation header would be probably better but it is the level of pedantry I don't have time for).

For SYCL you would need to implement analogs of thrust/{mult,reduce}_kernel.{cu,hip} somewhere. Not in thrust subdirectory if not implemented using thrust.

@victor-anisimov
Copy link
Collaborator Author

It is up to you if you prefer keeping a separate file for device-specific implementation instead of keeping the alike kernels in one file next to each other separated by compiler preprocessor directives as it is done in LibreTT. Seeing all three platform-specific kernels in one place makes it easier to propagate the change applied to one kernel to the other two making the support easier. The directory thrust basically holds some vector operations. Intel provides similar functionality to thrust in dpl library. I have quite similarly looking SYCL implementation for these kernels.

Compare
thrust::multiplies mul_op;
thrust::transform(
thrust::cuda::par.on(stream), thrust::device_pointer_cast(arg),
thrust::device_pointer_cast(arg) + n, thrust::device_pointer_cast(result),
thrust::device_pointer_cast(result), mul_op);

to
std::multiplies mul_op;
std::transform(
dpl::execution::make_device_policy(*stream), dpct::get_device_pointer(arg),
dpct::get_device_pointer(arg) + n, dpct::get_device_pointer(result),
dpct::get_device_pointer(result), mul_op);

If it is not an insentive to make the kernel API more general (more vendor-independent) then so it be.

@evaleev
Copy link
Member

evaleev commented Feb 17, 2024

status quo:

  • all thrust-based device code is localized in kernel/thrust/{mult,reduce}_kernel.h
  • these headers are then included in the .{cu,hip} source files which provide host-only functions (e.g. mult_kernel) invoked by TA tasks.
  • .{cu,hip} source files are identical, hence .hip file #includes .cu file directly.

Because both CUDA and HIP provide same Thrust API then there are no platform-dependent code blocks there. I agree with your arguments, and I'm fine adding DPC++-specific extensions to these headers. I believe that the "kernel" functions will be platform independent (e.g. mult_kernel calls mult_kernel_thrust, all platform-dependence is localized in mult_kernel_thrust). We will need separate {mult,reduce}_kernel.cpp files for DPC++ code, but these can again #include a common implementation source files (currently, {mult,reduce}_kernel.cu; probably better to move to {mult,reduce}_kernel.ipp files).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants