-
Notifications
You must be signed in to change notification settings - Fork 144
BLIS evaluation
- weekly update meetings at Thu 13:15 UTC (via Zoom: https://tiny.cc/eb_conf_call)
- collect info & scripts in https://github.com/easybuilders/blis-eval
- just push to
main
branch, no PRs needed
- just push to
- BLIS + libFLAME (LAPACK)
-
gobff
vsfoss
-
iibff
vsintel
- also FFTW?
(meeting cancelled)
-
Kenneth
- looked at FlexiBLAS at bit, half-working easyblock + easyconfig for it
- we could use FlexiBLAS as a toolchain component?
- collapses
- pick BLAS/LAPACK to use at runtime, but this is a global setting (via
$FLEXIBLAS
) - also complicates testing, for which BLAS backends do we run the numpy tests for example?
- Bart: single-threaded performance is probably more important than multi-threaded for BLAS?
- also supports profiling which BLAS/LAPACK functions are used
-
Jure: BLAS testing, all BLAS functions available in
blas-tester
evaluated- see https://github.com/easybuilders/blis-eval/blob/main/jpecar
- we should come up with a way to compare/plot the results
-
Sam: CP2K benchmarking
- H2O-128 benchmark
- 30% performance hit with
goblf
vsfoss
(on Intel Skylake)... - some profiling done, seems to be mostly
dgemm
? - ~10% difference with direct
dgemm
benchmark (matrix size 1k-8k)- mostly agrees with Jure's results
- ~20% with larger matrices (matrix size 10k)
- so 30% perf hit mostly due to dgemm?
- could look into FlexiBLAS profiling support to figure out dgemm matrix sizes used by CP2K
- should try to reproduce this via
numpy
, and also check on AMD Rome...
-
next meeting: Thu April 8th at 2.15pm CEST
-
Åke
- BLAS testing
- IEEE signalling issue reported to BLIS: https://github.com/flame/blis/issues/486
- Only happens on Broadwell (BLIS compiles for haswell)
- Skylake and AMD EPYC (zens2) are ok
- IEEE signalling issue reported to BLIS: https://github.com/flame/blis/issues/486
- LAPACK testing (using BLIS only)
- see https://github.com/easybuilders/blis-eval/blob/main/ake/blas-correctness-test/Test_results.md
- More results from Skylake and AMD EPYC (zen2)
- Skylake has more errors than Broadwell
- AMD (zen2) has the same amount as Broadwell
- LAPACK testing with libFlame (and refblas)
- libFLAME doesn't contain all the functions needed, so have to link with reflapack lib too.
- first test
xlintsts < stest.in
causes "Segmentation fault - invalid memory reference."
- BLAS testing
-
Kenneth
- numpy benchmarking: https://github.com/easybuilders/blis-eval/tree/main/apps/python
- OpenBLAS better than BLIS on low core counts (except 1)
- MKL is very jumpy on AMD (core pinning?)
- need to re-check pinning for MKL, and also check without
$OMP_*
for pinning- Maxim:
export KMP_AFFINITY=granularity=fine,scatter
- Maxim:
- toolchain for AMD forks for BLIS/libFLAME/ScaLAPACK/FFTW:
gobff/2021.03-amd
- numpy tests keep failing
- TODO:
- non-x86_64 (Arm, POWER)
- complete results
- better pinning for MKL
- other functions?
- numpy benchmarking: https://github.com/easybuilders/blis-eval/tree/main/apps/python
-
Maxim
- very similar results for
foss/2020b
andgobff/2020b
on Broadwell/Haswell - not sure if these runs are reliable, too short
- very similar results for
-
Sebastian
- testing with AMD forks (on top of JSC toolchains)
- libFLAME issues doesn't seem to be fixed
- problem reported with eigensolver in libFLAME, not fixed in AMD's fork
- naming of upstream BLIS vs AMD BLIS
- standard BLIS:
libblis.a
is multi-threaded when MT is enabled (or serial when MT is not enabled) - AMD's fork uses a suffix for the MT build:
libblis-mt.a
... - should we rename multi-threaded standard BLIS?
- assess perf. diff of serial vs multi-threaded BLIS
- standard BLIS:
- HMNS would have to be changed too, to allow multiple different BLAS libraries in the same "branch"
- clash of module names for
gomkl
vsfoss
installations - two possible solutions:
- add another level in the hierarchy for BLAS+LAPACK+FFTW
- copy
HierarchicalMNS
to a new MNS that adds an extra level
- copy
- use
versionsuffix
to discriminate between default BLAS library (e.g. BLIS) and others (e.g.-mkl
)- "fork"
HierarchicalMNS
to customize module name (add-mkl
or-blis
)
- "fork"
- add another level in the hierarchy for BLAS+LAPACK+FFTW
- clash of module names for
-
Sam
- CP2K with goblf (BLIS, LAPACK, no libFLAME):
- fixes all extra failed tests (summary now looks exactly the same as with foss)
- performance tests underway - need to use
$BLIS_NUM_THREADS
?- 30% slower on Skylake (4 cores) at first sight?
- CP2K
popt
with$OMP_NUM_THREADS
- CP2K with goblf (BLIS, LAPACK, no libFLAME):
- Kenneth
- failing numpy tests
- build BLIS differently doesn't help, same tests fail
- run
make test
(rather thanmake check
, which is only a minimal test suite) toolchainopts = {'optarch': False, 'vectorize': False, 'lowopt': True, 'strict': True}
-
buildopts = 'ENABLE_VERBOSE=yes'
(verbose build output)
- run
- SciPy-bundle changes
toolchainopts = {'vectorize': False}
- should also try with:
-
'noopt': True
=>-O0
'debug': True
-
- cause of these numerical problems is unclear...
- could use GCC's address sanitizer feature (asam) to find uninitialized variables
- FlexiBLAS: took a brief look at this, have something work on top of OpenBLAS+BLIS
- build BLIS differently doesn't help, same tests fail
- failing numpy tests
- Sam
- CP2K
- see https://github.com/easybuilders/blis-eval/blob/main/apps/cp2k/debug.md for GDB session to deep-dive into segfaulting CP2K run
- looks like it may be uninitialized value in libFLAME?
- should try to:
- also build libFLAME with stricter compilation options (incl.
noopt
) - try using LAPACK rather than libFLAME (
goblf/2020b
)- Åke has some patches for LAPACK that fix correctness issues
- see also https://github.com/akesandgren/lapack.git
- also run LAPACK test suite on top of libFLAME: https://github.com/easybuilders/blis-eval/tree/main/ake/blas-correctness-test
- same for numpy?
- also build libFLAME with stricter compilation options (incl.
- BLAS-Tester
- we need to control sizes for the matrices (
-N
) and target FLOPS (-F
) - see https://github.com/easybuilders/blis-eval/blob/main/jpecar/run-blas-tester.sh
- Jure is not checking for failing tests yet, will do
- current runs vary a lot, too small?
- we need to control sizes for the matrices (
- CP2K
- Sebastian
- BLAS-3 results on Zen2 with
gobff/2020b
vsfoss/2020b
+gomkl/2020b
- single-threaded
- BLIS and OpenBLAS are very close
- large gap with MKL (+20% slower), but a lot better than with older MKL versions
- see https://github.com/easybuilders/blis-eval/blob/main/low-level/blas3/plots/jurecadc_zen2/l3_perf_zen2_nt1.pdf
- clearly better than https://github.com/flame/blis/blob/master/docs/graphs/large/l3_perf_zen2_nt1.pdf
- multi-threaded (single socket)
- BLIS is a lot better than OpenBLAS, and to lesser extent better than MKL
- https://github.com/easybuilders/blis-eval/blob/main/low-level/blas3/plots/jurecadc_zen2/l3_perf_zen2_jc4ic4jr4_nt64.pdf
- single-threaded
- BLAS-3 results on Zen2 with
- what is our end goal for this?
- EB Tech Talk?
- paper?
- systems with diff. archs: x86_64 (Intel, AMD), Arm64 (Graviton2, A64FX), POWER9
- low-level benchmarks + apps like CP2K, numpy (once we figure out correctness testing)
-
Kenneth (pre-meeting notes)
- BLIS test step:
- we run
make check
, which only runs a lightweight test (checkblis-fast checkblas
, <1min) - we should run
make test
, which runs a slightly longer test (~5min on Haswell) - see also https://github.com/flame/blis/blob/master/docs/BuildSystem.md#step-3b-testing-optional + https://github.com/flame/blis/blob/master/docs/Testsuite.md
- we run
- numpy (see https://github.com/easybuilders/blis-eval/tree/main/apps/python])
- handful of tests in
numpy
test suite fail withgobff/2020b
andiibff/2020b
- not with
foss/2020b
orintel/2020b
, so due to BLIS+libFLAME? - same problem with
numpy
1.19.4 (SciPy-bundle 2020.11) and latestnumpy
1.20.1 - same problem with
gobff/2020.11
(BLIS version offoss/2020a
) - same problem on Intel Skylake and AMD Rome
- Åke: will test without -fno-math-errno to see if that makes a difference
- not with
- relevant links:
- numpy support for BLIS: https://github.com/numpy/numpy/issues/7372
- (closed) issues reporting similar test failures (
TestRandomDist.test_multivariate_normal[eigh]
): https://github.com/numpy/numpy/issues/15546, https://github.com/numpy/numpy/issues/16567
- TODO:
- Are others seeing the same test failures?
- Installing with
--skip-test-step
, reproduce problem outside ofnumpy
tests? - Open issue in
numpy
and/orBLIS
repo(s)?
- handful of tests in
- BLIS test step:
-
larger matrix sizes for dgemm tests on JUWELS Skylake by Sebastian
- see https://github.com/easybuilders/blis-eval/blob/main/low-level/dgemm/eval/juwels/dgemm-juwels.ipynb
- added results for AMD Rome 7742
- Single core performance good for all 3 implementations ~50 Gflops
- BLIS is fastest at socket/node level
-
BLAS 3 BLIS tests by Sebastian
- https://github.com/flame/blis/tree/master/test/3
- running on JUWELS (Skylake), still in progress, need to make plots
-
Sam: testing CP2K
- regression tests: more failures with
gobff/2020b
than withfoss/2020b
(?) 80 tests fail with segmentation faults, see https://github.com/easybuilders/blis-eval/tree/main/apps/cp2k
- regression tests: more failures with
-
Åke: more correctness testing -- no progress this week
-
BLAS-Tester tool: see notes from Åke below
- Sam ran them, correctness tests all passed, performance is a mixed bag
- Jure will run them too
-
stuff to look into:
- HPL on a couple of systems => Bart: still working on this, need to figure out optimal MPI/OpenMP configuration for BLIS on Intel.
- script to collect system info => Jure: https://github.com/easybuilders/blis-eval/blob/main/jpecar/checkenv.sh everyone should run it and see if it works. Can also use lstopo output
-
numpy tests by Kenneth, see https://github.com/easybuilders/blis-eval/tree/main/apps/python
- failing numpy tests
- 5 tests fail with
gobff/2020b
- 8 tests fail with
iibff/2020b
- 5 tests fail with
- need to double check how numpy was compiled on top of BLIS...
- also check with numpy built without optimizations (
-O0
)
- failing numpy tests
-
larger matrix sizes for dgemm tests on JUWELS Skylake by Sebastian
-
BLAS 3 BLIS tests by Sebastian
- https://github.com/flame/blis/tree/master/test/3
- running on JUWELS (Skylake), still running (~1h)
-
Sam: testing CP2K
- regression tests: same failures with
gobff/2020b
as withfoss/2020b
(?)
- regression tests: same failures with
-
Åke: more correctness testing
- see https://github.com/easybuilders/blis-eval/tree/main/ake/blas-correctness-test
- BLIS results look pretty good, better than OpenBLAS
- need to take a detailed look at how bad the failures are
-
BLAS-Tester tool: see notes from Åke below
-
does it make sense to compile BLIS with
-march=native
enabled? -
take a closer look at FlexiBLAS
- https://github.com/mpimd-csc/flexiblas - https://www.mpi-magdeburg.mpg.de/projects/flexiblas
- seems like a good choice for BLAS/LAPACK lib in toolchains?
- has some cool features, like profiling
-
stuff to look into:
- look into Åke's BLAS/LAPACK correctness stuff on AMD Rome => Kenneth
- check failing numpy tests => Kenneth
- BLAS 3 tests on JUWELS (Skylake, AMD Rome) => Sebastian
- study failing LAPACK tests a bit better + open issue to BLIS on this => Åke
- CP2K => Sam?
- HPL on a couple of systems => Bart
- stick to single socket to remove effects of interconnect, etc.
- HPL parameters need to be tweaked for different BLAS libraries...
- script to collect system info => Jure
-
new BLIS-based toolchains
- BLIS moved to GCCcore because it doesn't like being built with Intel compilers (see https://github.com/flame/blis/pull/372)
-
gobff/2020b
,iibff/2020b
(+gomkl/2020b
), to be included with EasyBuild v4.3.3
-
BLAS test suite (Åke)
- tested with
gobff/2020b
- BLAS tests suggest that BLIS isn't fully IEEE754 compliant?
- unclear whether this also happens with OpenBLAS or MKL
- Åke needs to refresh things a bit, perhaps reach out to BLIS developers
- see https://github.com/easybuilders/blis-eval/tree/main/ake/blas-correctness-test
- tested with
-
Sam tested https://github.com/xianyi/BLAS-Tester
- ran into linking errors when using BLIS
gcc -I./include -DAdd_ -DStringSunStyle -DATL_OS_Linux -DTHREADNUM=4 -DF77_INTEGER=int -fopenmp -m64 -O3 -o ./bin/xsl1blastst sl1blastst.o ATL_sf77rotg.o ATL_sf77rot.o ATL_sf77rotmg.o ATL_sf77rotm.o ATL_sf77swap.o ATL_sf77scal.o ATL_sf77copy.o ATL_sf77axpy.o ATL_sf77dot.o ATL_sdsf77dot.o ATL_dsf77dot.o ATL_sf77nrm2.o ATL_sf77asum.o ATL_sf77amax.o ATL_sf77rotgf.o ATL_sf77rotf.o ATL_sf77rotmgf.o ATL_sf77rotmf.o ATL_sf77swapf.o ATL_sf77scalf.o ATL_sf77copyf.o ATL_sf77axpyf.o ATL_sf77dotf.o ATL_sdsf77dotf.o ATL_dsf77dotf.o ATL_sf77nrm2f.o ATL_sf77asumf.o ATL_sf77amaxf.o ATL_sf77aminf.o ATL_flushcache.o ATL_sinfnrm.o ATL_rand.o ATL_svdiff.o ATL_sf77amin.o ./refblas/librefblas.a /apps/brussel/CO7/skylake/software/BLIS/0.8.0-GCCcore-10.2.0/lib/libblis.so -lm -lgfortran -lpthread ATL_sf77amin.o: ATL_f77amin.c:function OPENBLAS_sf77amin: error: undefined reference to 'isamin_' collect2: error: ld returned 1 exit status make: *** [xsl1blastst] Error 1
- Åke may be able to help with that...
- Use NO_EXTENSION=1
- And one can set TEST_BLAS=-lblis to make it simpler
- ran into linking errors when using BLIS
-
Sebastian starting with low-level benchmarks on JUWELS (Skylake partition)
- see https://github.com/easybuilders/blis-eval/tree/main/low-level/dgemm
- strange fluctuations with OpenBLAS on full node?
-
Sam is looking into building CP2K with
gobff
- already includes a regression test
- default: popt, should also look into psmp
-
correctness checking
- run netlib BLAS/LAPACK tests (Åke)
- netlib BLAS tests with BLIS
- netlib LAPACK tests with BLIS+LAPACK
- netlib LAPACK tests with BLIS+libFLAME
-
also https://github.com/xianyi/BLAS-Tester (Sam)does not work with BLIS
-
low-level performance testing (Sebastian)
- benchmark specific BLAS functions like dgemm
- https://github.com/flame/blis/tree/master/test/3
- compare to OpenBLAS/MKL
- Sebastian has a tool for interactive evaluation of BLAS/LAPACK functions
- requires Python 2
- see https://github.com/HPAC/ELAPS
-
gearshift FFTW benchmark (ask Miguel?)
- Kenneth: see also PR for Christian with FFTW app
- Sebastian, Kenneth
- gobff/2020a + 2020b (PR is ready)
- foss with OpenBLAS replaced by BLIS+libFLAME+FFTW
- compare with foss + gomkl
-
- custom gobff-amd (patched BLIS+libFLAME+FFTW)
- iibff
- intel with MKL replaced by BLIS+libFLAME+FFTW
- FFTW 3.3.9 is out
-
TODO: collect exact hardware info per site in blis-eval
- CPU model numbers, see
lscpu
output - memory channels (hwloc?,
sudo dmidecode -t memory
) - STREAM benchmark results
- see Åke custom version (more exact timings)
- CPU model numbers, see
-
AMD Rome
- HPC-UGent (doduo): Rome
- EMBL (Jure): Rome + Napels
- Compute Canada (Bart): Rome (single-node)
- JSC: Rome
- Azure (Davide): various Rome SKUs (124-core, 120 usable)
-
Intel
- HPC-UGent (Kenneth): Haswell, Skylake, Cascade Lake
- VUB (Sam): Ivy Bridge, Haswell, Broadwell, Skylake
- EMBL (Jure): Skylake
- SURF: Cascade Lake
- Compute Canada: same, KNL
- Umeå (Åke): Broadwell, Skylake, (KNL)
- JSC: Skylake
- Azure (Davide): various (incl. special)
-
other
- Arm (Kenneth @ AWS)
- POWER9 (Kenneth?, via UBirm.)
-
Bart: 6248 vs 6248R makes a big difference...
- HPL (Bart)
- CP2K (Sam, Robert)
- Sam has some experience with this
- h2o_128 benchmark included in CP2K
- VASP
- too dependent on their shitty code
- fair amount in BLAS, most in FFTW
- Åke: may not be a good fit for this effort...
- Åke has a test suite (correctness) + benchmarks (with some scientific validation)
- specific to VASP 5.4.4
- based on https://www.nsc.liu.se/~pla/vasptest/ by Peter Larsson (Åke has changes on top)
- numpy/scipy test suites (Kenneth)
- QuantumESPRESSO (Robert, Sebastian)
- standard benchmarks
- previous experiments by Bart
Some HPL results (could be improved upon) (LAPACK params) N NB P Q seconds GFLOPS (CPU, BLAS lib) ---------------------------------------------------------------------------------------------------- WR11C2R4 128000 384 8 8 678.88 2.059e+03 (7452 MKL2020.1) WR12R2R4 177000 192 8 8 1528.47 2.419e+03 (7452,MKL2020.0,MKL_DEBUG_CPU_TYPE=5) WR12R2R4 168960 232 4 4 1370.64 2.3461e+03 (7452, AMD BLIS) WR12R2R4 177000 232 4 4 1629.23 2.2691e+03 (7452, OpenBLAS)
- newer MKL versions have custom kernels for AMD Rome
- $MKL_DEBUG_CPU_TYPE no longer works with MKL 2020.1 (and is generally unsafe on AMD Rome)