Recalibrate _AXPY #1887

brada4 · 2018-11-26T17:00:33Z

brada4 · 2018-11-26T17:05:37Z

Justification and input.
I basically rigged loop from +=step to <<1
It is approximate, by magnitude bigger to bypass low area for 3rd thread (it just looks more impressive with 4, with 3 it is still < 1/2 of first thread)

_axpy.zip

martin-frbg · 2018-11-27T08:36:18Z

I have not yet looked at this in detail, but I noticed that (a) the proposed default for everything including "consumer" hardware exceeds the current limit for an IBM mainframe (b) your thresholds look as if you are counting bytes rather than vector elements - obviously the two are related, but I find it surprising that a vector of 512k floats would be processed faster in a single thread.

brada4 · 2018-11-27T10:37:42Z

N is in elements for axpy benchmark threshold is 2MB (cache I have) for S D Z and 1MB for C

The gain from threading at all is not so big, the damage from all 4 threads is quite big. sort of 20% slowdown because of no threading is nothing in place where it becomes eventual 10x problem.

Z13 wikipedia says has 2MB per cpu cache too, so pretty close to my laptop CPU, maybe it can be/should be unified acknowledging memory effect?

512k*4B perfectly fits in cache (just once, not 3x)

brada4 · 2018-11-27T10:43:25Z

Remove all threading gueards in interface/axpy.c
Rig benchmark increment loop into left shift in benchmark/axpy.c
Measurement says better safe (1thr > 2.1MB) than sorry (4thr <1MB)

#!/bin/sh
for a in s d c z ; do
        for b in 1 2 4 ; do
                OPENBLAS_LOOPS=97 OPENBLAS_NUM_THREADS=${b} ./${a}axpy.goto 1 35000000 1 2>&1 | tee ${a}axpy${b}.out
        done
done

brada4 · 2018-11-28T11:41:36Z

Something weird, atom hyperthreads improve speed, when they should get worse in practice.
Could it be l1_thread work splitting should align base for subsequent threads? I will try around idea, do not merge this.

brada4 · 2018-11-29T10:43:29Z

I could not invent much better:
4bytes goes 16 times in maximum practical of today Dcache line, so I round up elements to divisible by 16 , which yields min per-cpu chunk of 64 bytes for S 128 bytes for D C and 256 for Z (and something not so round for X)
It is better than nothing at first sight, presumes arguments aligned. In future can be used to fix alignment and use exact cache lines (at present those are not known at the point, so not so easy)
I will spend some time doing properly.

brada4 · 2018-11-30T15:20:36Z

Lets hold back for upcoming release.

brada4 added 2 commits November 26, 2018 17:49

init

c6dff28

_AXPY calibration for >2 cores

c067c1f

brada4 closed this Nov 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recalibrate _AXPY #1887

Recalibrate _AXPY #1887

brada4 commented Nov 26, 2018

brada4 commented Nov 26, 2018

martin-frbg commented Nov 27, 2018

brada4 commented Nov 27, 2018 •

edited

Loading

brada4 commented Nov 27, 2018

brada4 commented Nov 28, 2018

brada4 commented Nov 29, 2018

brada4 commented Nov 30, 2018

Recalibrate _AXPY #1887

Recalibrate _AXPY #1887

Conversation

brada4 commented Nov 26, 2018

brada4 commented Nov 26, 2018

martin-frbg commented Nov 27, 2018

brada4 commented Nov 27, 2018 • edited Loading

brada4 commented Nov 27, 2018

brada4 commented Nov 28, 2018

brada4 commented Nov 29, 2018

brada4 commented Nov 30, 2018

brada4 commented Nov 27, 2018 •

edited

Loading