Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Definition and shape inference for ONNXParallelOp and ONNXForkOp #2810

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

imaihal
Copy link
Collaborator

@imaihal imaihal commented May 1, 2024

This PR introduces new operations for Operator-level parallelization. Overall plan for implementation of Operator-level parallelization is described in issue #2811 . This PR is the first PR.

  • ONNXParallelOp
    • Specifies parallel region. ONNXForkOps should be included in the body region. The threads created by ONNXForkOp are synchronized with main thread at the end of this operation. The results to be used in following operations need to be set as the results of this operation by using ONNXYieldOp.
  • ONNXForkOp
    • Creates a thread to run the operations included in the body region of the operation.
%Y0, %Y1 = onnx.Parallel{
   %Y0 = onnx.Fork () -> tensor<>{
      // ONNX operations

      onnx.yield %Y

   }
   %Y1 = onnx.Fork () -> tensor<>{
      // ONNX operations

      onnx.yield %Y

  }
   onnx.yield %Y0, %Y1

}

Copy link
Collaborator

@AlexandreEichenberger AlexandreEichenberger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@imaihal very clean interface. High level question before I go into the details of the PR; how are the output of the ONNX.Parallel defined? Is the first output the output of the first fork, and the second output the output of the second fork?

Because in mlir, parallel already exists in thescf dialect as the "equivalent" of omp parallel for, maybe we should name this construct ONNX.ParallelForks? Just to avoid future conflicts if we were to introduce a parallel loop.

In term of OMP, the provided functionality here is the closest to the omp parallel sections with omp section inside. So that could be an alternative naming scheme (namely ONNX.ParallelSections and ONNX.Section).

@imaihal
Copy link
Collaborator Author

imaihal commented May 2, 2024

@AlexandreEichenberger Thanks for your comments!

how are the output of the ONNX.Parallel defined? Is the first output the output of the first fork, and the second output the output of the second fork?

I didn't consider it. The output order of ONNX.Parallel is the same as the inputs order of ONNX.Yield. That is, the output of second fork can be placed at the first output of ONNX.Parallel. Do you think we need to define the order?

Regarding naming, current ONNX.Parallel is too generic. So, I agree with your suggestion to change. I prefer ONNX.ParallelSections because it clearly expresses there is parallel region, but I would like to hear others' opinion.

@AlexandreEichenberger
Copy link
Collaborator

didn't consider it. The output order of ONNX.Parallel is the same as the inputs order of ONNX.Yield. That is, the output of second fork can be placed at the first output of ONNX.Parallel. Do you think we need to define the order?

The way the values of each "yield" gets reflected into the output of the "parallel" needs to be defined in the text associated with the operations. If the current approach is that the "first" fork/yield produces the "first" output value of the "parallel", and the "second" fork/yield generates the "second" output value of the parallel, that is perfectly fine. We just need to make sure the fork are not reordered (I am sure they are not), and clearly explain this ordering in the definition/example associated with the new operations.

@tungld
Copy link
Collaborator

tungld commented May 7, 2024

Thanks for proposing these operations. I have some high-level comments.

%Y0, %Y1 = onnx.Parallel{
   %Y0 = onnx.Fork () -> tensor<>{
      // ONNX operations

      onnx.yield %Y

   }
   %Y1 = onnx.Fork () -> tensor<>{
      // ONNX operations

      onnx.yield %Y

  }
   onnx.yield %Y0, %Y1

}

onnx.Parallel and onnx.Fork do not have input, so it looks to me you assumed global variables in ONNX so that operations inside onnx.Fork can use them directly, not via function arguments. In general, it's more difficult to reason about a program with global variables. Is it acceptable in the ONNX specification? E.g. any ONNX ops you know that use global variables.

There is a subtle difference between yield and return. return is often to stop the function and quit, while yield is like a generator just pausing the execution and the next execution (e.g. next iteration in a loop) can have the yield value as input. onnx.Return is more suitable.

It looks to me adding onnx.Parallel and onnx.Fork will change the onnx model a lot. By introducing regions, we need to move the original ONNX operations into the region of onnx.Parallel and onnx.Fork, which potentially makes optimizations (e.g. removing Stick/Unstick pair) difficult. Not sure about if the lowering to krnl dialect is easy or not. Have you considered any other approaches? e.g. annotating operations like the ones the ONNX Multi-Device group is working on.

@chentong319
Copy link
Collaborator

The other structural ONNX ops, such as ONNX.IF and ONNX.Loop, can read value from outside, but all the generated values can be used outside only through yield. Do the new ops have the same behavior?

@chentong319
Copy link
Collaborator

To me, the parallel op is a scheduling decision, and should be added in a later phase. They should be not be in onnx dialect, if there is another choice. By the way, can we use omp.sections for this purpose?

@imaihal
Copy link
Collaborator Author

imaihal commented May 10, 2024

@AlexandreEichenberger

The way the values of each "yield" gets reflected into the output of the "parallel" needs to be defined in the text associated with the operations.

I updated the description for the operations. Thanks.
However, we may need to consider the case that onnx.Fork op has multiple results. In the case, i-th results of onnx.Parallel is not the result of i-th onnx.Fork.

@imaihal
Copy link
Collaborator Author

imaihal commented May 10, 2024

@chentong319

The other structural ONNX ops, such as ONNX.IF and ONNX.Loop, can read value from outside, but all the generated values can be used outside only through yield. Do the new ops have the same behavior?

Yes. The ONNXParallel and ONNXFork have the same behavior.

To me, the parallel op is a scheduling decision, and should be added in a later phase. They should be not be in onnx dialect, if there is another choice. By the way, can we use omp.sections for this purpose?

We assume these operations are inserted just before ONNXToKrnl pass after applying optimizations in ONNX IR.(Currently we are using a rewriting tool for testing. We generate ONNXIR using -EmitONNXIR, then insert ONNXParallelOp and ONNXForkOp using the tool).

I think omp.section is memref-level operation. We wanted tensor-level operations to parallelize in tensor level. Currently in PR #2756, we used krnl.parallel in memre-level reusing existing openMP parallelization, but we should be able to lower these operations into omp.section.

@imaihal imaihal marked this pull request as draft October 9, 2024 01:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants