since 0.8.2
The API would probably be in the form of a one-modifier that executes a function on the GPU after decomposing it. Example use case:
1 + •_CUDAForAll ↕10
⟨ 1 2 3 4 5 6 7 8 9 10 ⟩
This would take the function and apply it to each element of the array 𝕩
.
Using a combination of •Decompose
and •PrimInd
, a given function could be
broken down into primitives and data values, which may all be copied over to
the GPU device.
There would also exist another version of the runtime compiled for GPU.
An example Thrust implementation of •_CUDAForAll
that only modifies primitves
might look like this:
O<Value> CUDAForAll::call(u8 nargs, std::vector<O<Value>> args) {
// 𝕩 must be an array
auto x = dynamic_pointer_cast<Array>(args[1]);
auto f = args[4];
// Copy BQN array to the device vector
auto h_vals = thrust::host_vector<f64>(x->N());
for (int i=0; i<x->N(); i++)
h_vals[i] = dynamic_pointer_cast<Number>(x->values[i])->v;
auto d_vals = thrust::device_vector<f64>(h_vals);
// 𝕗 must be a primitive, since we're using device implementations of runtime
// primitives on the GPU
auto it = std::find(runtime.begin(), runtime.end(), f);
if (it == runtime.end())
throw std::runtime_error("•_CUDAForAll: 𝕗 must be a primitive");
// Find the corresponding GPU-enabled version of the given runtime primitive 𝕗
auto kernel = gpu_runtime[std::distance(runtime.begin(), it)];
// Execute kernel in-place on device
thrust::transform(d_vals.begin(), d_vals.end(), d_vals.begin(), kernel);
// Copy data back to host
h_vals = d_vals;
// Wrap the return values into a BQN value
auto ret = make_shared<Array>(x->N());
for (int i=0; i<x->N(); i++)
ret->values[i] = make_shared<Number>(h_vals[i]);
return ret;
}
A similar approach may be taken for enabling OMP/threaded execution.
For this example, I build with GCC 9.3.1, CUDA 11.1.105, and CMake 3.22.0. I also build with readline to get a nicer repl.
CC=gcc-9 CXX=g++-9 cmake .. -DCXBQN_CUDA=ON -DCXBQN_READLINE=ON
make -j 12
./BQN -r
3 < •_CUDAFor ↕10
⟨ 0 0 0 0 1 1 1 1 1 1 ⟩
(↕10) ⋆ •_CUDAFor ⌽↕10
⟨ 0 1 128 729 1024 625 216 49 8 1 ⟩
(↕10) ⌊ •_CUDAFor ⌽↕10
⟨ 0 1 2 3 4 4 3 2 1 0 ⟩
As you can see, only very basic arithmetic functions are supported in CUDA at this time.