-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recalibrate _AXPY #1887
Recalibrate _AXPY #1887
Conversation
Justification and input. |
I have not yet looked at this in detail, but I noticed that (a) the proposed default for everything including "consumer" hardware exceeds the current limit for an IBM mainframe (b) your thresholds look as if you are counting bytes rather than vector elements - obviously the two are related, but I find it surprising that a vector of 512k floats would be processed faster in a single thread. |
N is in elements for axpy benchmark threshold is 2MB (cache I have) for S D Z and 1MB for C The gain from threading at all is not so big, the damage from all 4 threads is quite big. sort of 20% slowdown because of no threading is nothing in place where it becomes eventual 10x problem. Z13 wikipedia says has 2MB per cpu cache too, so pretty close to my laptop CPU, maybe it can be/should be unified acknowledging memory effect? 512k*4B perfectly fits in cache (just once, not 3x) |
Remove all threading gueards in interface/axpy.c
|
Something weird, atom hyperthreads improve speed, when they should get worse in practice. |
I could not invent much better: |
Lets hold back for upcoming release. |
#1886 #1883