Is Tiledarray ready for SYCL conversion? #427

victor-anisimov · 2023-10-19T17:23:54Z

I'm trying to compile the latest commit d0c0ded of tiledarray for CUDA platform (V100) and get a compiler error:

[ 95%] Building CXX object src/CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o
cd /home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/src && /soft/packaging/spack-builds/linux-opensuse_leap15-x86_64/gcc-10.2.0/gcc-10.2.0-yudlyezca7twgd5o3wkkraur7wdbngdn/bin/g++ -DBOOST_ALL_NO_LIB -DBOOST_SERIALIZATION_DYN_LINK -DBTAS_ASSERT_THROWS=1 -DBTAS_DEFAULT_TARGET_MAX_INDEX_RANK=6 -DBTAS_HAS_BLAS_LAPACK=1 -DBTAS_HAS_BOOST_CONTAINER=1 -DBTAS_HAS_BOOST_ITERATOR=1 -DBTAS_HAS_BOOST_SERIALIZATION=1 -DBTAS_HAS_CBLAS=1 -DBTAS_HAS_INTEL_MKL=1 -DBTAS_HAS_LAPACKE=1 -DLAPACK_COMPLEX_CPP=1 -DLIBRETT_USES_CUDA=1 -DMADNESS_DISABLE_WORLD_GET_DEFAULT=1 -DMADNESS_LINALG_USE_LAPACKE=1 -DMADNESS_MPI_HEADER="/soft/restricted/CNDA/mpich/drop51.2/mpich-ofi-sockets-icc-default-gpu-drop51/include/mpi.h" -DMPICH_SKIP_MPICXX -DOMPI_SKIP_MPICXX -D_MPICC_H -I/home/vanisimov/tiledarray/master/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/src -I/soft/compilers/cuda/cuda-12.0.0/targets/x86_64-linux/include -I/soft/restricted/CNDA/updates/2023.05.15.001/oneapi/mkl/2023u2_20230512/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-src/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-build/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-build/src/madness/world -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/madness-src/src/madness/world -I/soft/restricted/CNDA/mpich/drop51.2/mpich-ofi-sockets-icc-default-gpu-drop51/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/eigen-src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/btas-src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/blaspp-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/blaspp-src/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/lapackpp-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/lapackpp-src/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src/tpl -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-src/src/tpl/umpire/camp/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/_deps/umpire-build/src/tpl/umpire/camp/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/deps/umpire-build/include -I/home/vanisimov/tiledarray/master/build.V100.gcc10.mpich/deps/librett-src/src -isystem /soft/compilers/cuda/cuda-12.0.0/include -I/home/vanisimov/tiledarray/install/include -std=c++17 -D__TBB_LEGACY_MODE=1 -DTBB_SUPPRESS_DEPRECATED_MESSAGES=1 -I/home/vanisimov/tiledarray/install/include -O0 -g -Wall -ftemplate-backtrace-limit=0 -std=c++17 -fPIC -I/home/vanisimov/tiledarray/install/include -std=c++17 -D__TBB_LEGACY_MODE=1 -DTBB_SUPPRESS_DEPRECATED_MESSAGES=1 -fopenmp -MD -MT src/CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o -MF CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o.d -o CMakeFiles/tiledarray.dir/TiledArray/sparse_shape.cpp.o -c /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp
In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/utility.h:33,
from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:30,
from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:30,
from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26:
/home/vanisimov/tiledarray/master/src/TiledArray/tiled_range1.h: In static member function 'static TiledArray::TiledRange1 TiledArray::TiledRange1::make_uniform(std::size_t, std::size_t)':
/home/vanisimov/tiledarray/master/src/TiledArray/tiled_range1.h:268:41: warning: comparison of integer expressions of different signedness: 'long int' and 'std::size_t' {aka 'long unsigned int'} [-Wsign-compare]
268 | for (auto i = num_avg_plus_one; i < ntiles;
| ~~^~~~~~~~
In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26:
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h: In instantiation of 'TiledArray::Tensor<T, A>::Tensor(const T1&, const Perm&) [with T1 = TiledArray::Tensor; Perm = TiledArray::Permutation; typename std::enable_if<(is_nested_tensor_v && is_permutation_v)>::type* = 0; T = float; Allocator = Eigen::aligned_allocator]':
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:1279:16: required from 'TiledArray::Tensor<T, A> TiledArray::Tensor<T, A>::permute(const Perm&) const [with Perm = TiledArray::Permutation; = void; T = float; Allocator = Eigen::aligned_allocator]'
/home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:1201:44: required from 'TiledArray::SparseShape::SparseShape TiledArray::SparseShape::perm(const TiledArray::Permutation&) const [with T = float; TiledArray::SparseShape::SparseShape = TiledArray::SparseShape]'
/home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:30:16: required from here
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:404:24: error: no matching function for call to 'tensor_init(, TiledArray::Permutation, TiledArray::Tensor&, const TiledArray::Tensor&)'
404 | detail::tensor_init(value_converter, outer(perm),
| ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405 | this, other);
| ~~~~~~~~~~~~~
In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:30,
from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26:
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:449:13: note: candidate: 'template<class Op, class TR, class ... Ts, typename std::enable_if<(TiledArray::detail::is_tensor<T2, Ts ...>::value && TiledArray::detail::is_contiguous_tensor<T2, Ts ...>::value)>::type > void TiledArray::detail::tensor_init(Op&&, TR&, const Ts& ...)'
449 | inline void tensor_init(Op&& op, TR& result, const Ts&... tensors) {
| ^~~~~~~~~~~
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/kernels.h:449:13: note: template argument deduction/substitution failed:
In file included from /home/vanisimov/tiledarray/master/src/TiledArray/tensor.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.h:31,
from /home/vanisimov/tiledarray/master/src/TiledArray/sparse_shape.cpp:26:
/home/vanisimov/tiledarray/master/src/TiledArray/tensor/tensor.h:404:24: note: couldn't deduce template parameter 'Op'
404 | detail::tensor_init(value_converter, outer(perm),
| ~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
405 | *this, other);
| ~~~~~~~~~~~~~

The text was updated successfully, but these errors were encountered:

evaleev · 2023-10-19T17:29:03Z

@victor-anisimov I think this is an issue due to a regression in gcc 10.2 ... do you have 10.3 available? That's the main compiler we use on our cluster, it should work.

victor-anisimov · 2023-10-19T18:47:56Z

gcc/11.1.0 worked for me. Thank you very much for the suggestion! However, I see many tests failing.

  Start  1: madness/test/world/build

1/26 Test #1: madness/test/world/build ................... Passed 53.98 sec
Start 2: madness/test/world/test_prof/run
2/26 Test #2: madness/test/world/test_prof/run ...........***Failed 1.19 sec
Start 3: madness/test/world/test_ar/run
3/26 Test #3: madness/test/world/test_ar/run .............***Failed 0.64 sec
Start 4: madness/test/world/test_hashdc/run
4/26 Test #4: madness/test/world/test_hashdc/run .........***Failed 0.64 sec
Start 5: madness/test/world/test_hello/run
5/26 Test #5: madness/test/world/test_hello/run ..........***Failed 0.64 sec
Start 6: madness/test/world/test_atomicint/run
6/26 Test #6: madness/test/world/test_atomicint/run ......***Failed 0.64 sec
Start 7: madness/test/world/test_future/run
7/26 Test #7: madness/test/world/test_future/run .........***Failed 0.64 sec
Start 8: madness/test/world/test_future2/run
8/26 Test #8: madness/test/world/test_future2/run ........***Failed 0.64 sec
Start 9: madness/test/world/test_future3/run
9/26 Test #9: madness/test/world/test_future3/run ........***Failed 0.64 sec
Start 10: madness/test/world/test_dc/run
10/26 Test #10: madness/test/world/test_dc/run .............***Failed 0.64 sec
Start 11: madness/test/world/test_hashthreaded/run
11/26 Test #11: madness/test/world/test_hashthreaded/run ...***Failed 0.64 sec
Start 12: madness/test/world/test_queue/run
12/26 Test #12: madness/test/world/test_queue/run ..........***Failed 0.63 sec
Start 13: madness/test/world/test_world/run
13/26 Test #13: madness/test/world/test_world/run ..........***Failed 0.66 sec
Start 14: madness/test/world/test_worldprofile/run
14/26 Test #14: madness/test/world/test_worldprofile/run ...***Failed 0.68 sec
Start 15: madness/test/world/test_binsorter/run
15/26 Test #15: madness/test/world/test_binsorter/run ......***Failed 0.64 sec
Start 16: madness/test/world/test_vector/run
16/26 Test #16: madness/test/world/test_vector/run ......... Passed 0.35 sec
Start 17: madness/test/world/test_worldptr/run
17/26 Test #17: madness/test/world/test_worldptr/run .......***Failed 0.92 sec
Start 18: madness/test/world/test_worldref/run
18/26 Test #18: madness/test/world/test_worldref/run .......***Failed 0.87 sec
Start 19: madness/test/world/test_stack/run
19/26 Test #19: madness/test/world/test_stack/run .......... Passed 0.12 sec
Start 20: madness/test/world/test_googletest/run
20/26 Test #20: madness/test/world/test_googletest/run ..... Passed 0.66 sec
Start 21: madness/test/world/test_tree/run
21/26 Test #21: madness/test/world/test_tree/run ...........***Failed 0.64 sec
Start 22: btas/unit/build
22/26 Test #22: btas/unit/build ............................ Passed 54.24 sec
Start 23: btas/unit/run
23/26 Test #23: btas/unit/run .............................. Passed 215.30 sec
Start 24: tiledarray/unit/build
24/26 Test #24: tiledarray/unit/build ...................... Passed 1082.45 sec
Start 25: tiledarray/unit/run-np-1
25/26 Test #25: tiledarray/unit/run-np-1 ...................***Failed 1.43 sec
Start 26: tiledarray/unit/run-np-2
26/26 Test #26: tiledarray/unit/run-np-2 ...................***Failed 0.75 sec

27% tests passed, 19 tests failed out of 26

evaleev · 2023-10-19T18:56:38Z

I recommend you only run TA-specific tests. Build 'check-tiledarray' target. It looks like the tests that are failing are madness tests. This is likely that MPI programs on your system cannot be launched directly ... Need srun?

…

On Thu, Oct 19, 2023, 1:48 PM Victor Anisimov ***@***.***> wrote: gcc/11.1.0 worked for me. Thank you very much for the suggestion! However, I see many tests failing. Start 1: madness/test/world/build 1/26 Test #1 <#1>: madness/test/world/build ................... Passed 53.98 sec Start 2: madness/test/world/test_prof/run 2/26 Test #2 <#2>: madness/test/world/test_prof/run ...........***Failed 1.19 sec Start 3: madness/test/world/test_ar/run 3/26 Test #3 <#3>: madness/test/world/test_ar/run .............***Failed 0.64 sec Start 4: madness/test/world/test_hashdc/run 4/26 Test #4 <#4>: madness/test/world/test_hashdc/run .........***Failed 0.64 sec Start 5: madness/test/world/test_hello/run 5/26 Test #5 <#5>: madness/test/world/test_hello/run ..........***Failed 0.64 sec Start 6: madness/test/world/test_atomicint/run 6/26 Test #6 <#6>: madness/test/world/test_atomicint/run ......***Failed 0.64 sec Start 7: madness/test/world/test_future/run 7/26 Test #7 <#7>: madness/test/world/test_future/run .........***Failed 0.64 sec Start 8: madness/test/world/test_future2/run 8/26 Test #8 <#8>: madness/test/world/test_future2/run ........***Failed 0.64 sec Start 9: madness/test/world/test_future3/run 9/26 Test #9 <#9>: madness/test/world/test_future3/run ........***Failed 0.64 sec Start 10: madness/test/world/test_dc/run 10/26 Test #10 <#10>: madness/test/world/test_dc/run .............***Failed 0.64 sec Start 11: madness/test/world/test_hashthreaded/run 11/26 Test #11 <#11>: madness/test/world/test_hashthreaded/run ...***Failed 0.64 sec Start 12: madness/test/world/test_queue/run 12/26 Test #12 <#12>: madness/test/world/test_queue/run ..........***Failed 0.63 sec Start 13: madness/test/world/test_world/run 13/26 Test #13 <#13>: madness/test/world/test_world/run ..........***Failed 0.66 sec Start 14: madness/test/world/test_worldprofile/run 14/26 Test #14 <#14>: madness/test/world/test_worldprofile/run ...***Failed 0.68 sec Start 15: madness/test/world/test_binsorter/run 15/26 Test #15 <#15>: madness/test/world/test_binsorter/run ......***Failed 0.64 sec Start 16: madness/test/world/test_vector/run 16/26 Test #16 <#16>: madness/test/world/test_vector/run ......... Passed 0.35 sec Start 17: madness/test/world/test_worldptr/run 17/26 Test #17 <#17>: madness/test/world/test_worldptr/run .......***Failed 0.92 sec Start 18: madness/test/world/test_worldref/run 18/26 Test #18 <#18>: madness/test/world/test_worldref/run .......***Failed 0.87 sec Start 19: madness/test/world/test_stack/run 19/26 Test #19 <#19>: madness/test/world/test_stack/run .......... Passed 0.12 sec Start 20: madness/test/world/test_googletest/run 20/26 Test #20 <#20>: madness/test/world/test_googletest/run ..... Passed 0.66 sec Start 21: madness/test/world/test_tree/run 21/26 Test #21 <#21>: madness/test/world/test_tree/run ...........***Failed 0.64 sec Start 22: btas/unit/build 22/26 Test #22 <#22>: btas/unit/build ............................ Passed 54.24 sec Start 23: btas/unit/run 23/26 Test #23 <#23>: btas/unit/run .............................. Passed 215.30 sec Start 24: tiledarray/unit/build 24/26 Test #24 <#24>: tiledarray/unit/build ...................... Passed 1082.45 sec Start 25: tiledarray/unit/run-np-1 25/26 Test #25 <#25>: tiledarray/unit/run-np-1 ...................***Failed 1.43 sec Start 26: tiledarray/unit/run-np-2 26/26 Test #26 <#26>: tiledarray/unit/run-np-2 ...................***Failed 0.75 sec 27% tests passed, 19 tests failed out of 26 — Reply to this email directly, view it on GitHub <#427 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAQXIZ4BSIXMNJANNPQ4B5DYAFYWPAVCNFSM6AAAAAA6HR7QM2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONZRGUZTENRXHE> . You are receiving this because you commented.Message ID: ***@***.***>

victor-anisimov · 2023-10-19T19:34:00Z

Yes, you are right! My MPI environment is broken. It crashes even on hello world. I'll rerun the tests on Polaris at some later time.

I have one more question, though. I saw the change in directory structure from cpu/cuda to host/device. Also a number of cuda suffixes in file names has been changed to suffix device, which is great. However, I still can see cpu_cuda_vector in src/TiledArray/device. Are you going to rename it?

For HIP/SYCL, you would also need to rename .cu files to .cpp and modify the CMake scripts accordingly
./src/TiledArray/device/cpu_cuda_vector.cu
./src/TiledArray/device/kernel/thrust/mult_kernel.cu
./src/TiledArray/device/kernel/thrust/reduce_kernel.cu
./src/TiledArray/device/um_storage.cu

Do you plan to make that change or keep the file extensions as they are? Thanks!

evaleev · 2023-10-19T20:29:00Z

I have one more question, though. I saw the change in directory structure from cpu/cuda to host/device. Also a number of cuda suffixes in file names has been changed to suffix device, which is great. However, I still can see cpu_cuda_vector in src/TiledArray/device. Are you going to rename it?

the problem with cpu_cuda_vector is that it requires Thrust. Initially I was not sure whether Thrust would be a viable option. It seems that ROCm now provides it, not so sure what the SYCL story is on this front (it probably is just standard C++ there?).

In any case, since cpu_cuda_vector is not actually used I do not plan to maintain it (will revive if needed). Probably should be marked deprecated.

For HIP/SYCL, you would also need to rename .cu files to .cpp and modify the CMake scripts accordingly ./src/TiledArray/device/cpu_cuda_vector.cu ./src/TiledArray/device/kernel/thrust/mult_kernel.cu ./src/TiledArray/device/kernel/thrust/reduce_kernel.cu ./src/TiledArray/device/um_storage.cu

See above re: cpu_cuda_vector.cu. Same story for um_storage.cu actually: unused, can be deprecated.

The rest are CUDA-specific implementation files ... they are actually portable, but CMake requires a separate file for each device-specific implementation so notice I just include mult_kernel.cu into mult_kernel.hip (using implementation header would be probably better but it is the level of pedantry I don't have time for).

For SYCL you would need to implement analogs of thrust/{mult,reduce}_kernel.{cu,hip} somewhere. Not in thrust subdirectory if not implemented using thrust.

victor-anisimov · 2023-10-19T21:38:44Z

It is up to you if you prefer keeping a separate file for device-specific implementation instead of keeping the alike kernels in one file next to each other separated by compiler preprocessor directives as it is done in LibreTT. Seeing all three platform-specific kernels in one place makes it easier to propagate the change applied to one kernel to the other two making the support easier. The directory thrust basically holds some vector operations. Intel provides similar functionality to thrust in dpl library. I have quite similarly looking SYCL implementation for these kernels.

Compare
thrust::multiplies mul_op;
thrust::transform(
thrust::cuda::par.on(stream), thrust::device_pointer_cast(arg),
thrust::device_pointer_cast(arg) + n, thrust::device_pointer_cast(result),
thrust::device_pointer_cast(result), mul_op);

to
std::multiplies mul_op;
std::transform(
dpl::execution::make_device_policy(*stream), dpct::get_device_pointer(arg),
dpct::get_device_pointer(arg) + n, dpct::get_device_pointer(result),
dpct::get_device_pointer(result), mul_op);

If it is not an insentive to make the kernel API more general (more vendor-independent) then so it be.

evaleev · 2024-02-17T20:44:53Z

status quo:

all thrust-based device code is localized in kernel/thrust/{mult,reduce}_kernel.h
these headers are then included in the .{cu,hip} source files which provide host-only functions (e.g. mult_kernel) invoked by TA tasks.
.{cu,hip} source files are identical, hence .hip file #includes .cu file directly.

Because both CUDA and HIP provide same Thrust API then there are no platform-dependent code blocks there. I agree with your arguments, and I'm fine adding DPC++-specific extensions to these headers. I believe that the "kernel" functions will be platform independent (e.g. mult_kernel calls mult_kernel_thrust, all platform-dependence is localized in mult_kernel_thrust). We will need separate {mult,reduce}_kernel.cpp files for DPC++ code, but these can again #include a common implementation source files (currently, {mult,reduce}_kernel.cu; probably better to move to {mult,reduce}_kernel.ipp files).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is Tiledarray ready for SYCL conversion? #427

Is Tiledarray ready for SYCL conversion? #427

victor-anisimov commented Oct 19, 2023

evaleev commented Oct 19, 2023

victor-anisimov commented Oct 19, 2023

evaleev commented Oct 19, 2023 via email

victor-anisimov commented Oct 19, 2023

evaleev commented Oct 19, 2023

victor-anisimov commented Oct 19, 2023

evaleev commented Feb 17, 2024

Is Tiledarray ready for SYCL conversion? #427

Is Tiledarray ready for SYCL conversion? #427

Comments

victor-anisimov commented Oct 19, 2023

evaleev commented Oct 19, 2023

victor-anisimov commented Oct 19, 2023

evaleev commented Oct 19, 2023 via email

victor-anisimov commented Oct 19, 2023

evaleev commented Oct 19, 2023

victor-anisimov commented Oct 19, 2023

evaleev commented Feb 17, 2024