-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Operator] Add vstack op [MooreThreads] #175
Conversation
dddc647
to
5abf8de
Compare
eb04e1d
to
5ad2b7c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
grid = lambda META: ( | ||
triton.cdiv(max_tile_elems, META["BLOCK_SIZE"]), | ||
scheduled_num_tensors, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the 4 tensors to be concatenated in this iteration have very varying number of rows, this grid may have many CTAs doing nothing. Do you have some test about the performance at this case? Maybe we can sort this tensors according to their number of rows.
Also, maybe this strategy is worth only when the number of tensors to vstack is large enough? But it is a good idea to take 4 tensors a time, compared to a naive one-by-one strategy.
add vstack operator
perf of some cases on NV A100: