OV GPU integration #207

kurapov-peter · 2024-08-01T12:08:36Z

No description provided.

dchigarev · 2024-08-27T09:04:55Z

Openvino branch with the current state: https://github.com/dchigarev/openvino/tree/gc-gpu (semi-working)

Current status:
There are two pipelines in OpenVINO that can use our GC-GPU pipeline:

OpenVINO CPU:
- the data is originally on CPU
- the mlir module accepts CPU pointers
- our module copies the data to GPU and the result back to CPU
- our mlir module is responsible for creating cl::Buffers and cl::CommandQueue
The GC-GPU integration with the openVINO CPU pipeline works. A simple mlir module can be executed and the output result is correct.
OpenVINO GPU (star* marked points mean, that they're not yet implemented):
- the data is originally on GPU (cl::Buffer) (allocated by OV)
- *the mlir module accepts pointers to (cl_mem/cl::Buffer)
- *the mlir module accepts a pointer to cl::CommandQueue (that contains information about device)
- *the mlir module using our OpenCLRuntimeWrapper submits our kernel to the received queue
This part doesn't function for now. Working on it.
From where we can receive cl::Buffers/cl::CommandQueue in openVINO
To where we should propagate them

An example of a simple MLIR module that GC receives from openVINO

module @fragment_name {
  func.func @entry(%arg0: memref<64x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<64x128xf32>) {
    %0 = bufferization.to_tensor %arg0 restrict : memref<64x128xf32>
    %1 = bufferization.to_tensor %arg1 restrict : memref<128x128xf32>
    %2 = tensor.empty() : tensor<64x128xf32>
    %cst = arith.constant 0.000000e+00 : f32
    %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<64x128xf32>) -> tensor<64x128xf32>
    %4 = linalg.matmul ins(%0, %1 : tensor<64x128xf32>, tensor<128x128xf32>) outs(%3 : tensor<64x128xf32>) -> tensor<64x128xf32>
    %5 = tensor.empty() : tensor<64x128xf32>
    %6 = linalg.add ins(%4, %0 : tensor<64x128xf32>, tensor<64x128xf32>) outs(%5 : tensor<64x128xf32>) -> tensor<64x128xf32>
    bufferization.materialize_in_destination %6 in restrict writable %arg2 : (tensor<64x128xf32>, memref<64x128xf32>) -> ()
    return
  }
}

kurapov-peter · 2024-08-27T10:11:36Z

I think we can skip the queue passing (#230) for the time being and create a new one each time we submit a kernel. This should still be functional, queue passing should be a separate problem.

dchigarev · 2024-09-03T12:37:49Z

Current status:

Simple openvino subgraphs are successfully compiled and executed using GPU pipeline from GC
GPU buffers that are allocated on OpenVino side are successfully propagated and obtained* by our openclRuntime (*with hacks that help to distinguish between USM and cl::Buffer in openclRuntime. See 'Things left to do.1' for more info)
Working on a set of sanity-tests in OpenVino for GC integration.

Things left to do:

We need to distinguish between USM and cl::Buffer pointers in our OpenCLRuntimeWrapper::launchKernel as we need to handle them differently when setting as a kernel's argument (we have to call clSetKernelArg for cl::Buffer and clSetKernelArgMemPointerINTEL for USM).

There are two options of how we can do this:
- a. Use clGetMemAllocInfoINTEL to determine if a pointer we're dealing with is a USM (requires openvino's cl::queue and context)
- b. Figure out a structure that describes which argument is what and pass it from openvino to our runtime
For now we're waiting for Support external queue for kernel submission #230 to be completed and then we'll try to go with the option 'a.'
Figure out a mechanism of how we can pass openvino's cl::queue to our openclRuntime (Support external queue for kernel submission #230). The queue can be extracted from here as dynamic_cast<cldnn::ocl::ocl_stream&>(stream).get_cl_queue() and then propagated to the MLIR module
Support more linalg operations in linalg-to-xegpu pass. OpenVino tends to produce linalg.matmul_transpose_* operations that are not supported by our xegpu lowering. The unsupported operations still work though but xegpu ops are not used.

For example: in case of linalg.matmul_transpose_b regular arith.mulf + arith.addf in a for loop would be used instead of xegpu.dpas intrinsic.
Align our IMEX and GC forks that were created for this integration with upstream

An example of a simple MLIR module that GC receives from openVINO

module @fragment_name {
  func.func @entry(%arg0: memref<64x128xf32>, %arg1: memref<128x128xf32>, %arg2: memref<64x128xf32>) {
    %0 = bufferization.to_tensor %arg0 restrict : memref<64x128xf32>
    %1 = bufferization.to_tensor %arg1 restrict : memref<128x128xf32>
    %2 = tensor.empty() : tensor<64x128xf32>
    %cst = arith.constant 0.000000e+00 : f32
    %3 = linalg.fill ins(%cst : f32) outs(%2 : tensor<64x128xf32>) -> tensor<64x128xf32>
    %4 = linalg.matmul ins(%0, %1 : tensor<64x128xf32>, tensor<128x128xf32>) outs(%3 : tensor<64x128xf32>) -> tensor<64x128xf32>
    %5 = tensor.empty() : tensor<64x128xf32>
    %6 = linalg.add ins(%4, %0 : tensor<64x128xf32>, tensor<64x128xf32>) outs(%5 : tensor<64x128xf32>) -> tensor<64x128xf32>
    bufferization.materialize_in_destination %6 in restrict writable %arg2 : (tensor<64x128xf32>, memref<64x128xf32>) -> ()
    return
  }
}

cc @AndreyPavlenko

AndreyPavlenko · 2024-09-03T13:00:38Z

The clGetMemAllocInfoINTEL approach seems not reliable. We have agreed on adding additional parameters to MLIR and to pass the buffer types in the params.

dchigarev · 2024-09-06T15:12:32Z

PR to OV with the integration: slyalin/openvino#169

dchigarev · 2024-09-10T10:40:23Z

Current status:

Have made a PR to OV with the first version of integration GC-GPU integration slyalin/openvino#169. The PR also includes a simple sanity tests for the integration.
@AndreyPavlenko is working on propagating OpenCL queue and other meta-info from OV to GC (Support external queue for kernel submission #230). The first draft version is already present in the integration PR in OV. The second draft - Convert a subset of GPU dialect ops to the OpenCL GPU runtime calls #333 . A new pass has been implemented to convert a subset of gpu dialect to OpenCL GPU runtime calls.
Submitted first PRs to GC and IMEX that port the changes required for the integration from our forks to upstream:

Things left to do:

Decide on how to propagate the information to insert-gpu-allocs pass that the input memrefs are already on GPU:
1. Approach A - the proper one: Assign #gpu.address_space to the input memrefs on the OV side and add a logic to the insert-gpu-allocs pass that acknowledges them.
  
  The problem with this approach is that certain passes of the GPU pipeline fail if #gpu.address_space attribute is assigned to memrefs. A proper solution for that is to patch MemrefToSpirv pass from LLVM, a hacky solution is to forcibly remove #gpu.address_space attribute from all memrefs right after insert-gpu-allocs pass.
2. Approach B - hacky but simple: simply add a parameter to the insert-gpu-allocs pass that would indicate that all input memrefs are on GPU and it shouldn't do anything about them.
3. Approach C - get rid of the insert-gpu-allocs pass: we can add a simple pass, that converts all mallocs to GPU USM memory allocations.
We're leaning to the Approach A for now. Although it looks more difficult, all the parts of the approach are already implemented:
- Assigning #gpu.address_space could be done using gc::AddRemoveGpuAddressSpace pass from Andrey's fork of GC. The pass is expected to be applied at the beginning of the GC pipeline in OV.
- Acknowledging the address spaces by the insert-gpu-allocs pass is also already implemented by Andrey in his fork of IMEX.
- Patch to LLVM that fixes failures in our pipeline caused by #gpu.address_space is also implemented
- If we'll decide not to patch LLVM, a pass to remove #gpu.address_spaces is also already implemented
If we are choosing Approach A we only need to decide on how we're going to fix the GPU pipeline failure caused by #gpu.address_spaces. Should this be a patch to LLVM or are we okay to simply remove the addr spaces once we've done with them? @AndreyPavlenko @kurapov-peter thoughts?
Decide on what we consider a "completed integration". @kurapov-peter do we want to have something specific to work via our OpenVino -> GC pipeline (some models or something, how complex they should be)? Do we want to establish some benchmarks in OV?
The GPU buffers on the OV side are often allocated as USM Host memory, this may potentially harm performance. Need to figure out why it's happening and what's the real impact.

dchigarev · 2024-09-17T12:59:04Z

PRs in GC&IMEX that are left to merge to make the integration working:

kurapov-peter · 2024-10-28T11:44:04Z

Closing as completed

kurapov-peter added this to the GPU milestone Aug 1, 2024

kurapov-peter linked a pull request Aug 1, 2024 that will close this issue

Make the project compatible with FetchContent and find_package #200

Merged

kurapov-peter closed this as completed in #200 Aug 2, 2024

kurapov-peter reopened this Aug 9, 2024

kurapov-peter assigned dchigarev Aug 12, 2024

dchigarev mentioned this issue Aug 13, 2024

Integrate GPU pipeline from GC to openvino #237

Closed

lmontigny modified the milestones: 0.2 GPU Upstream, 0.2 GPU OpenVino Integration Sep 5, 2024

dchigarev mentioned this issue Sep 13, 2024

[LinalgToXeGPU] Support conversion for linalg.matmul with transpose_b #340

Closed

kurapov-peter closed this as completed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OV GPU integration #207

OV GPU integration #207

kurapov-peter commented Aug 1, 2024

dchigarev commented Aug 27, 2024 •

edited

Loading

kurapov-peter commented Aug 27, 2024

dchigarev commented Sep 3, 2024 •

edited

Loading

AndreyPavlenko commented Sep 3, 2024 •

edited

Loading

dchigarev commented Sep 6, 2024

dchigarev commented Sep 10, 2024 •

edited by AndreyPavlenko

Loading

dchigarev commented Sep 17, 2024 •

edited

Loading

kurapov-peter commented Oct 28, 2024

OV GPU integration #207

OV GPU integration #207

Comments

kurapov-peter commented Aug 1, 2024

dchigarev commented Aug 27, 2024 • edited Loading

kurapov-peter commented Aug 27, 2024

dchigarev commented Sep 3, 2024 • edited Loading

AndreyPavlenko commented Sep 3, 2024 • edited Loading

dchigarev commented Sep 6, 2024

dchigarev commented Sep 10, 2024 • edited by AndreyPavlenko Loading

dchigarev commented Sep 17, 2024 • edited Loading

kurapov-peter commented Oct 28, 2024

dchigarev commented Aug 27, 2024 •

edited

Loading

dchigarev commented Sep 3, 2024 •

edited

Loading

AndreyPavlenko commented Sep 3, 2024 •

edited

Loading

dchigarev commented Sep 10, 2024 •

edited by AndreyPavlenko

Loading

dchigarev commented Sep 17, 2024 •

edited

Loading