Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recalibrate _AXPY #1887

Closed
wants to merge 2 commits into from
Closed

Recalibrate _AXPY #1887

wants to merge 2 commits into from

Conversation

brada4
Copy link
Contributor

@brada4 brada4 commented Nov 26, 2018

@brada4
Copy link
Contributor Author

brada4 commented Nov 26, 2018

Justification and input.
I basically rigged loop from +=step to <<1
It is approximate, by magnitude bigger to bypass low area for 3rd thread (it just looks more impressive with 4, with 3 it is still < 1/2 of first thread)

_axpy.zip

@martin-frbg
Copy link
Collaborator

I have not yet looked at this in detail, but I noticed that (a) the proposed default for everything including "consumer" hardware exceeds the current limit for an IBM mainframe (b) your thresholds look as if you are counting bytes rather than vector elements - obviously the two are related, but I find it surprising that a vector of 512k floats would be processed faster in a single thread.

@brada4
Copy link
Contributor Author

brada4 commented Nov 27, 2018

N is in elements for axpy benchmark threshold is 2MB (cache I have) for S D Z and 1MB for C

The gain from threading at all is not so big, the damage from all 4 threads is quite big. sort of 20% slowdown because of no threading is nothing in place where it becomes eventual 10x problem.

Z13 wikipedia says has 2MB per cpu cache too, so pretty close to my laptop CPU, maybe it can be/should be unified acknowledging memory effect?

512k*4B perfectly fits in cache (just once, not 3x)

@brada4
Copy link
Contributor Author

brada4 commented Nov 27, 2018

Remove all threading gueards in interface/axpy.c
Rig benchmark increment loop into left shift in benchmark/axpy.c
Measurement says better safe (1thr > 2.1MB) than sorry (4thr <1MB)

#!/bin/sh
for a in s d c z ; do
        for b in 1 2 4 ; do
                OPENBLAS_LOOPS=97 OPENBLAS_NUM_THREADS=${b} ./${a}axpy.goto 1 35000000 1 2>&1 | tee ${a}axpy${b}.out
        done
done

@brada4
Copy link
Contributor Author

brada4 commented Nov 28, 2018

Something weird, atom hyperthreads improve speed, when they should get worse in practice.
Could it be l1_thread work splitting should align base for subsequent threads? I will try around idea, do not merge this.

@brada4
Copy link
Contributor Author

brada4 commented Nov 29, 2018

I could not invent much better:
4bytes goes 16 times in maximum practical of today Dcache line, so I round up elements to divisible by 16 , which yields min per-cpu chunk of 64 bytes for S 128 bytes for D C and 256 for Z (and something not so round for X)
It is better than nothing at first sight, presumes arguments aligned. In future can be used to fix alignment and use exact cache lines (at present those are not known at the point, so not so easy)
I will spend some time doing properly.

@brada4
Copy link
Contributor Author

brada4 commented Nov 30, 2018

Lets hold back for upcoming release.

@brada4 brada4 closed this Nov 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants