index_select: Optimizing the kernel with reducing for-loops in TensorInfo OffsetCalculator #924

majing921201 · 2024-09-19T08:17:25Z

Two reasons for the slow perf in index_select

We used static loops times 12
We used int64_t for offset index, PVC doesn't have long datatype instruction, so it takes about 30us for once offset calculation.

So we have following optimization in this pr:
1, aligned CUDA, using dynamic loop boundry
2, optimized offset calculator

#816

We got 2x perf improvement in index_select

                                               Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg      Self XPU    Self XPU %     XPU total  XPU time avg    # of Calls

                                 aten::index_select        17.34%       2.161ms        41.05%       5.115ms      85.257us      12.734ms       100.00%      12.734ms     212.237us            60

Signed-off-by: majing <[email protected]>

This reverts commit 95b0798.

This reverts commit 3558073.

Signed-off-by: majing <[email protected]>

…jing/index

Signed-off-by: majing <[email protected]>

majing921201 added 2 commits September 19, 2024 08:15

Optimized indexing related ops performance

3558073

Signed-off-by: majing <[email protected]>

Remove code

95b0798

Signed-off-by: majing <[email protected]>

majing921201 mentioned this pull request Sep 19, 2024

[BF16]For LayoutLMForSequenceClassification model on stock pytorch, index_select cost time on pvc-1100 worse than A100 * ratio #816

Open

majing921201 and others added 4 commits September 19, 2024 08:50

Revert "Remove code"

682a330

This reverts commit 95b0798.

Revert "Optimized indexing related ops performance"

da7692e

This reverts commit 3558073.

Optimized offset calculator in indexing ops

94937b0

Signed-off-by: majing <[email protected]>

Merge branch 'main' into majing/index

5f59473

fengyuan14 changed the title ~~Optimized indexing related ops performance~~ index_select: Optimizing the kernel with reducing for-loops in TensorInfo OffsetCalculator Sep 23, 2024

majing921201 added 3 commits September 24, 2024 05:46

Skip failed case dataloader

7ae3668

Signed-off-by: majing <[email protected]>

Merge branch 'main' of https://github.com/intel/torch-xpu-ops into ma…

ef51a8b

…jing/index

Add comments

c80b6c0

Signed-off-by: majing <[email protected]>

fengyuan14 approved these changes Sep 26, 2024

View reviewed changes

fengyuan14 added this pull request to the merge queue Sep 26, 2024

Merged via the queue into main with commit d9ae62d Sep 26, 2024
3 checks passed

fengyuan14 deleted the majing/index branch September 26, 2024 05:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index_select: Optimizing the kernel with reducing for-loops in TensorInfo OffsetCalculator #924

index_select: Optimizing the kernel with reducing for-loops in TensorInfo OffsetCalculator #924

majing921201 commented Sep 19, 2024 •

edited

Loading

index_select: Optimizing the kernel with reducing for-loops in TensorInfo OffsetCalculator #924

index_select: Optimizing the kernel with reducing for-loops in TensorInfo OffsetCalculator #924

Conversation

majing921201 commented Sep 19, 2024 • edited Loading

majing921201 commented Sep 19, 2024 •

edited

Loading