Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index_select: Optimizing the kernel with reducing for-loops in TensorInfo OffsetCalculator #924

Merged
merged 9 commits into from
Sep 26, 2024

Conversation

majing921201
Copy link
Contributor

@majing921201 majing921201 commented Sep 19, 2024

Two reasons for the slow perf in index_select

  1. We used static loops times 12
  2. We used int64_t for offset index, PVC doesn't have long datatype instruction, so it takes about 30us for once offset calculation.

So we have following optimization in this pr:
1, aligned CUDA, using dynamic loop boundry
2, optimized offset calculator

#816

We got 2x perf improvement in index_select


                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls

                                 aten::index_select        17.34%       2.161ms        41.05%       5.115ms      85.257us      12.734ms       100.00%      12.734ms     212.237us            60

@fengyuan14 fengyuan14 changed the title Optimized indexing related ops performance index_select: Optimizing the kernel with reducing for-loops in TensorInfo OffsetCalculator Sep 23, 2024
@fengyuan14 fengyuan14 added this pull request to the merge queue Sep 26, 2024
Merged via the queue into main with commit d9ae62d Sep 26, 2024
3 checks passed
@fengyuan14 fengyuan14 deleted the majing/index branch September 26, 2024 05:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants