Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bf16 matmul's corresponding tensor.pack not properly optimized #320

Open
yifeizh2 opened this issue Sep 5, 2024 · 4 comments
Open

bf16 matmul's corresponding tensor.pack not properly optimized #320

yifeizh2 opened this issue Sep 5, 2024 · 4 comments
Assignees
Labels
performance Speedup expected

Comments

@yifeizh2
Copy link
Contributor

yifeizh2 commented Sep 5, 2024

Currently, the following 2 single-layer MLP have worst performance compared with GC v1.

<style> </style>
dtype batch size hidden list GC V1 8c55a05 remove brgemm read lock
bf16 128 1024x1024 0.0286 0.0828
bf16 128 1024x512 0.0204 0.0670

We performed detailed breakdown as follows:

<style> </style>
128x1024x1024 GC v1 8c55a05
matmul only 0.01766 0.01989
tiled pack (or reorder) 0.02634 0.04632
total 0.04418 0.077969

and

<style> </style>
128x1024x512 GC v1 8c55a05
matmul only 0.01587 0.01591
tiled pack (or reorder) 0.01278 0.0398
total 0.02881 0.06917

Are there any further optimization opportunity for vnni pack?

@yifeizh2 yifeizh2 added the bug Something isn't working label Sep 5, 2024
@BRUCE11111
Copy link
Contributor

BRUCE11111 commented Sep 5, 2024

VNNI reorder will be included in my to-do list. However, the current priority is to merge the physical register pass and the corresponding vector-based op fusion under static shape into master as soon as possible (within two weeks). Then support dynamic shape for the sake of another issue, and then optimize the instruction level of specific op like vnni reorder. I can switch priorities if there is a more urgent need.

@ZhennanQin
Copy link
Contributor

I guess those VNNI reorder can be folded out if we have constant weight cache support? @niuxiaog Can you try to enable weight cache for both bench-gc and OV integration?

@niuxiaog
Copy link
Contributor

niuxiaog commented Sep 5, 2024

I'm working on enabling it with OV and may finish in this week. For bench-gc, maybe next week.

@lmontigny lmontigny added this to the 0.1 CPU - Performance tuning milestone Sep 5, 2024
@yifeizh2 yifeizh2 added performance Speedup expected and removed bug Something isn't working labels Sep 10, 2024
@lmontigny
Copy link

waiting for dynamic shape

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Speedup expected
Projects
None yet
Development

No branches or pull requests

5 participants