Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OV GPU integration #207

Closed
kurapov-peter opened this issue Aug 1, 2024 · 8 comments · Fixed by #200
Closed

OV GPU integration #207

kurapov-peter opened this issue Aug 1, 2024 · 8 comments · Fixed by #200
Assignees

Comments

@kurapov-peter
Copy link
Contributor

No description provided.

@dchigarev
Copy link
Contributor

dchigarev commented Aug 27, 2024

Openvino branch with the current state: https://github.com/dchigarev/openvino/tree/gc-gpu (semi-working)

Current status:
There are two pipelines in OpenVINO that can use our GC-GPU pipeline:

  1. OpenVINO CPU:

    • the data is originally on CPU
    • the mlir module accepts CPU pointers
    • our module copies the data to GPU and the result back to CPU
    • our mlir module is responsible for creating cl::Buffers and cl::CommandQueue

    The GC-GPU integration with the openVINO CPU pipeline works. A simple mlir module can be executed and the output result is correct.

  2. OpenVINO GPU (star* marked points mean, that they're not yet implemented):

    • the data is originally on GPU (cl::Buffer) (allocated by OV)
    • *the mlir module accepts pointers to (cl_mem/cl::Buffer)
    • *the mlir module accepts a pointer to cl::CommandQueue (that contains information about device)
    • *the mlir module using our OpenCLRuntimeWrapper submits our kernel to the received queue

    This part doesn't function for now. Working on it.
    From where we can receive cl::Buffers/cl::CommandQueue in openVINO
    To where we should propagate them

An example of a simple MLIR module that GC receives from openVINO
module @fragment_name {
  func.func @entry(%arg0: memref<64x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<64x128xf32>) {
    %0 = bufferization.to_tensor %arg0 restrict : memref<64x128xf32>
    %1 = bufferization.to_tensor %arg1 restrict : memref<128x128xf32>
    %2 = tensor.empty() : tensor<64x128xf32>
    %cst = arith.constant 0.000000e+00 : f32
    %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<64x128xf32>) -> tensor<64x128xf32>
    %4 = linalg.matmul ins(%0, %1 : tensor<64x128xf32>, tensor<128x128xf32>) outs(%3 : tensor<64x128xf32>) -> tensor<64x128xf32>
    %5 = tensor.empty() : tensor<64x128xf32>
    %6 = linalg.add ins(%4, %0 : tensor<64x128xf32>, tensor<64x128xf32>) outs(%5 : tensor<64x128xf32>) -> tensor<64x128xf32>
    bufferization.materialize_in_destination %6 in restrict writable %arg2 : (tensor<64x128xf32>, memref<64x128xf32>) -> ()
    return
  }
}

@kurapov-peter
Copy link
Contributor Author

I think we can skip the queue passing (#230) for the time being and create a new one each time we submit a kernel. This should still be functional, queue passing should be a separate problem.

@dchigarev
Copy link
Contributor

dchigarev commented Sep 3, 2024

Current status:

  1. Simple openvino subgraphs are successfully compiled and executed using GPU pipeline from GC
  2. GPU buffers that are allocated on OpenVino side are successfully propagated and obtained* by our openclRuntime (*with hacks that help to distinguish between USM and cl::Buffer in openclRuntime. See 'Things left to do.1' for more info)
  3. Working on a set of sanity-tests in OpenVino for GC integration.

Things left to do:

  1. We need to distinguish between USM and cl::Buffer pointers in our OpenCLRuntimeWrapper::launchKernel as we need to handle them differently when setting as a kernel's argument (we have to call clSetKernelArg for cl::Buffer and clSetKernelArgMemPointerINTEL for USM).

    There are two options of how we can do this:

    • a. Use clGetMemAllocInfoINTEL to determine if a pointer we're dealing with is a USM (requires openvino's cl::queue and context)
    • b. Figure out a structure that describes which argument is what and pass it from openvino to our runtime

    For now we're waiting for Support external queue for kernel submission #230 to be completed and then we'll try to go with the option 'a.'

  2. Figure out a mechanism of how we can pass openvino's cl::queue to our openclRuntime (Support external queue for kernel submission #230). The queue can be extracted from here as dynamic_cast<cldnn::ocl::ocl_stream&>(stream).get_cl_queue() and then propagated to the MLIR module

  3. Support more linalg operations in linalg-to-xegpu pass. OpenVino tends to produce linalg.matmul_transpose_* operations that are not supported by our xegpu lowering. The unsupported operations still work though but xegpu ops are not used.

    For example: in case of linalg.matmul_transpose_b regular arith.mulf + arith.addf in a for loop would be used instead of xegpu.dpas intrinsic.

  4. Align our IMEX and GC forks that were created for this integration with upstream

An example of a simple MLIR module that GC receives from openVINO
module @fragment_name {
  func.func @entry(%arg0: memref<64x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<64x128xf32>) {
    %0 = bufferization.to_tensor %arg0 restrict : memref<64x128xf32>
    %1 = bufferization.to_tensor %arg1 restrict : memref<128x128xf32>
    %2 = tensor.empty() : tensor<64x128xf32>
    %cst = arith.constant 0.000000e+00 : f32
    %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<64x128xf32>) -> tensor<64x128xf32>
    %4 = linalg.matmul ins(%0, %1 : tensor<64x128xf32>, tensor<128x128xf32>) outs(%3 : tensor<64x128xf32>) -> tensor<64x128xf32>
    %5 = tensor.empty() : tensor<64x128xf32>
    %6 = linalg.add ins(%4, %0 : tensor<64x128xf32>, tensor<64x128xf32>) outs(%5 : tensor<64x128xf32>) -> tensor<64x128xf32>
    bufferization.materialize_in_destination %6 in restrict writable %arg2 : (tensor<64x128xf32>, memref<64x128xf32>) -> ()
    return
  }
}

cc @AndreyPavlenko

@AndreyPavlenko
Copy link
Contributor

AndreyPavlenko commented Sep 3, 2024

The clGetMemAllocInfoINTEL approach seems not reliable. We have agreed on adding additional parameters to MLIR and to pass the buffer types in the params.

@dchigarev
Copy link
Contributor

PR to OV with the integration: slyalin/openvino#169

@dchigarev
Copy link
Contributor

dchigarev commented Sep 10, 2024

Current status:

  1. Have made a PR to OV with the first version of integration GC-GPU integration slyalin/openvino#169. The PR also includes a simple sanity tests for the integration.
  2. @AndreyPavlenko is working on propagating OpenCL queue and other meta-info from OV to GC (Support external queue for kernel submission #230). The first draft version is already present in the integration PR in OV. The second draft - Convert a subset of GPU dialect ops to the OpenCL GPU runtime calls #333 . A new pass has been implemented to convert a subset of gpu dialect to OpenCL GPU runtime calls.
  3. Submitted first PRs to GC and IMEX that port the changes required for the integration from our forks to upstream:

Things left to do:

  1. Decide on how to propagate the information to insert-gpu-allocs pass that the input memrefs are already on GPU:

    1. Approach A - the proper one: Assign #gpu.address_space to the input memrefs on the OV side and add a logic to the insert-gpu-allocs pass that acknowledges them.

      The problem with this approach is that certain passes of the GPU pipeline fail if #gpu.address_space attribute is assigned to memrefs. A proper solution for that is to patch MemrefToSpirv pass from LLVM, a hacky solution is to forcibly remove #gpu.address_space attribute from all memrefs right after insert-gpu-allocs pass.

    2. Approach B - hacky but simple: simply add a parameter to the insert-gpu-allocs pass that would indicate that all input memrefs are on GPU and it shouldn't do anything about them.

    3. Approach C - get rid of the insert-gpu-allocs pass: we can add a simple pass, that converts all mallocs to GPU USM memory allocations.

    We're leaning to the Approach A for now. Although it looks more difficult, all the parts of the approach are already implemented:

    If we are choosing Approach A we only need to decide on how we're going to fix the GPU pipeline failure caused by #gpu.address_spaces. Should this be a patch to LLVM or are we okay to simply remove the addr spaces once we've done with them? @AndreyPavlenko @kurapov-peter thoughts?

  2. Decide on what we consider a "completed integration". @kurapov-peter do we want to have something specific to work via our OpenVino -> GC pipeline (some models or something, how complex they should be)? Do we want to establish some benchmarks in OV?

  3. The GPU buffers on the OV side are often allocated as USM Host memory, this may potentially harm performance. Need to figure out why it's happening and what's the real impact.

@kurapov-peter
Copy link
Contributor Author

Closing as completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants