Functions are not guarded against early threading #1886

brada4 · 2018-11-26T16:58:56Z

Not exhaustive list, just those used in practice, sort of measurement takes some time

Discovered in #1882 - dgetrf zherk dsyrk
From #1883 - axpy
From #1897 (mxnet) cblas_?syrk _potri_
From JuliaRobotics/RigidBodyDynamics.jl#500 trsm

The text was updated successfully, but these errors were encountered:

brada4 · 2018-11-26T17:29:54Z

Probably worth making into task list, sorting by absent guard and suspected wrong guard.

martin-frbg · 2019-01-13T21:56:44Z

Based on preliminary benchmarks it may make sense to double the multithreading thresholds in interface/trsm.c for the two COMPLEX cases and raise it to four or five times the current limit (of 2*GEMM_MULTITHREAD_THRESHOLD) for float and double.

brada4 · 2019-02-05T20:53:21Z

#1989 (comment) has test case vs dsymv via lapack, there is no performance regression, but no thread threshold either.

martin-frbg · 2019-02-06T08:21:57Z

#1115 is relevant for dsyrk (and suggests that level3 unrolling is an important factor besides the multithreading thresholds or lack thereof in the interface)

drhpc · 2021-05-29T09:33:49Z

Any progress on this? I am trying to get a grip on DrTimothyAldenDavis/SuiteSparse#1 to decide if it is worth it packaging Octave with sparse matrix computations via SuiteSparse and OpenBLAS in pkgsrc. It's quite a bummer to have the prominent notice not to use OpenBLAS for SuiteSparse.

For now I do not even know how real the threat of abysmal performance is (factor 1000?!). Uncertainty, doubt. It's an unfortunate situation that doesn't seem to see progress.

martin-frbg · 2021-05-29T11:11:54Z

I'm just now experimenting with a few things (#3252 that is a bit broken on arm64 right now)

brada4 · 2021-05-29T12:04:46Z

@drhpc the initial idea is fairly simple to draw the line in the sand where computing switches from one to all cpus, it might be slightlt dated with more and more cores per CPU appear.

cholmod manages threads via (G|I)OMP, and calls these:

    liblapack.so.3 => zpotrf_,dpotrf_
    libopenblas_pthreads.so.0 => dtrsm_,ztrsm_,dgemm_,dtrsv_,ztrsv_,zgemv_,zherk_,dgemv_,dsyrk_,zgemm_

drhpc · 2021-05-29T12:41:35Z

So you actually need some formular or table to use 1, 2, 4, … $many cores depending on data size? Lots of heuristics have to go into that:-/ Right now we still got 16 cores per node in the cluster, but soon that will probably be 128. Some steps in between seem sensible.

drhpc · 2021-05-29T12:45:45Z

My issue right now is that I'd like to see an actual little test that checks how badly we are actually affected by such. I am no user of cholmod myself. Those (vague) reports about deadlock-like situations with factor 1000 speed-down are frightening.

And frankly, I got a colleague who has deep trouble understanding why you would call an optimized/parallelized dense BLAS on bits of sparse matrices anyway. But apparently people do that. We both are outside our realms of experience there.

brada4 · 2021-05-29T13:39:51Z

I moved to other discipline of computers, If you have time, at least give it a try:

In interfaces/ pick one that does not have thread threshold, so you cannot make it worse than it is...
drill respective benchmark/one-threaded, all-threaded, overlay graphs, find roughly what size of tasks starts to gain well from all-threading.
Note that Level1 BLAS has one dimension and is easy to do, Level2 has 2, and on Level3 you cannot even draw a 4D picture, you need to model optimum, then track back to input parameters.
Most helpful is linux perf record ; perf report , excess threading is marked as something else using more CPU time than blas function.
Another is the work splitters right after threading, shall they split in cache-line, memory-page, NUMA stride chunks (that would limit threading for super-small samples by other constraint)
Another - if L2 caches communicate between them at slower than own speed, mostly yes, It might be worth to ladder core per full cache. e.g worst case would be pumping page over system bus between cores eating out bandwidth from RAM and PCI, happens visibly on low-power SoCs

IMO sparsesuite used pthread version that each openmp thread actually was nproc more threads, and most time was consumed task-switching and pulling data between RAM and cache, but I might be completely wrong.

martin-frbg · 2021-05-29T17:32:18Z

I am not sure if heuristics for ramping up the number of threads is what is needed, experiments so far seem to indicate that the added overhead and the need to get it right across a wide range of platforms makes it not worthwile.
In addition to the general overhead of premature multithreading for small cases, I have recently identified a historical oversight where a memory buffer is allocated although it is never used - this is what the PR is trying to address. At least in pthreads-based builds, the speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code.

martin-frbg · 2021-06-02T06:49:26Z

My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware. AVX512 may be a slightly different story as that is less well supported in OpenBLAS, but I doubt the apocryphal "a 100 times slower" and I am not inclined to go looking for that without at least some attempt at a reproducer

drhpc · 2021-06-02T07:07:38Z

My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware.

I am trying to find out how relevant this issue is for us. It might well only hit for high numbers of threads/cores.

Can you give me the exact test code/compiler commands you used? I'll run it on our machines with up to 40 cores and compare.

brada4 · 2021-06-02T11:33:02Z

CHOLMOD/Demo/Readme.txt explains that.
OpenBLAS must be built typing make.
I confirm Martin's result on early pre-avx512 skylake with 6 cores + HT , on AMD though tables turn other way around.

martin-frbg · 2021-06-02T12:52:49Z

I built OpenBLAS with default options (and both with and without USE_OPENMP=1), built SuiteSparse with either make CC=icc CXX=icc or make BLAS="/opt/OpenBLAS/lib -lopenblas" LAPACK="/opt/OpenBLAS/lib -lopenblas. Then mainly ran cholmod_demo <Matrix/bcsstk01.tri from its CHOLMOD/Demo subdirectory. (Did not see much variation in results for OpenBLAS versions ranging from 0.2.17 to today, but no dramatic improvements from my proposed patches either.)
Have not tried anything bigger than the old hexacore i7-8700K in my office, may find time to run it on Ryzen 3900X or package the whole thing for CI. (On AMD you need to set an environment variable to make MKL skip its cpu checks, else you'll end up
in a crawling "compatibilty mode" when it sees a non-Intel cpu)

martin-frbg · 2021-06-07T20:47:58Z

Davis states he cannot remember when or which version, also expects the provided matrix files in the cholmod demo to be much too small to reproduce the problem. I see about 20 percent advantage for MKL with bigger files from his sparse matrix collection now (e.g. the 18000x18000 nd6k.mtx that is referenced in the "gpu.sh" example or the 46000x46000 bcsstk39.mtx), the main culprit appears to be SYRK (which is not affected by my experimental patches as of yet)

drhpc · 2021-06-07T21:41:45Z

Sorry for being silent here … but I figured that my tests with these examples wouldn't probably not change much in the picture. 20 % difference definitely is not what he is talking about, we're looking for 100 times slowdown with multiple threads:-/

drhpc · 2021-06-07T23:45:28Z

Actually …

IMO sparsesuite used pthread version that each openmp thread actually was nproc more threads, and most time was consumed task-switching and pulling data between RAM and cache, but I might be completely wrong.

… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS?

brada4 · 2021-06-08T15:35:04Z

… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS?

Quite often, at least performance impact stated looks that spectacular. Some performance profile from damage is very welcome here, one can even send screenshot picture that non-productive function eats lots of CPU time, there are some cases where we could re-construct regression source from that.

n^3 overcommitment can be achieved too. Run n processes, then automatic n omp threads in process each prime n pthreads. That would be characterized as "grinding for days without any sensible result while one-cpu docker finishes in X seconds..."

martin-frbg · 2021-06-09T16:46:53Z

FWIW, I got our drone.io CI to build and run SuiteSparse with current develop branch on 24-core/48-thread AMD Epyc, and while I do not have any MKL numbers for comparison, at least the OpenBLAS version does not lock up or spend extraordinary time on the 18k-by-18k "nd6k" cholmod test. I could repeat the same test on a 96-core ARM64 server there if desired.

drhpc · 2021-06-09T18:04:01Z

And we got Tim Davis digging for that original test case … which will be part of the CI soon, I hope.

One thought: Could the CI on the 24 core EPYC easily do a run with past OpenBLAS versions to test the hypothesis that the introduction of extra locking around 3 years ago influenced things, and maybe the behaviour got fixed in the meantime?

martin-frbg · 2021-06-09T18:27:19Z

I could probably massage the yml file into downloading some older release instead of the current head. (And I now believe the claim of the extreme slowdown comes from #2265 (late 2019, but only referring to unspecified prepackaged Ubuntu and RedHat builds of OpenBLAS, which could have been anything from a year earlier, or even a bug with the libgomp or whatever)

martin-frbg · 2021-06-13T12:45:51Z

Unfortunately the performance of the CI host turns out to be far too variable to use for benchmarking past releases (more than a factor of ten between individual runs of cholmod <nd6k.mtx with the exact same build)

martin-frbg · 2021-11-09T21:48:24Z

A pointer to a test case was just added to the suitesparse issue ticket mentioned above (CHOLMOD of a 60000x60000 matrix, multithreaded run seen spending an inordinate amount of "system" time probably spinning on a lock somewhere)

brada4 · 2021-11-10T17:20:13Z

Normally it is to run 1 and 2 and all NUM_THREADS ans compare top10 in perf record -p <pid> ; perf report
It was sched_yield often, but with that now out i dont know what it could be.
PS almost forgot about this.

martin-frbg · 2021-11-10T20:08:54Z

Seems I can only reproduce that when I combine (OpenMP-using) SuiteSparse with a non-OpenMP build of OpenBLAS, when both are OpenMP it is a simple case of finding the optimum number of threads for the problem size. In the broken ("mixed") case,
most time is spent in gomp_barrier_wait_end - not sure if there is anything new to learn from this.

martin-frbg · 2021-11-10T22:39:19Z

Came across mlpack/mlpack#1858 and the suggestion to set OMP_WAIT_POLICY=passive solves the problem for the "mixed" run (though in that case single-threading remains 40 percent faster than even running with 2 threads,and using all 6 is 3.4 times slower than single.)

brada4 · 2021-11-11T13:36:10Z

Set threshold something samplesize < pagesize * ncpu without measurement?
I think thats faster , and obvious !damage dragging page between CPU caches....

martin-frbg · 2024-01-14T21:56:02Z

closing here as the original topic has been resolved by adding lower bounds and in some cases shortcuts in the 0.3.16 to 0.3.21 timeframe. The suitesparse issue is also tracked in #2265 - I intend to perform another test run with current versions locally and/or in the GCC compile farm but preliminary results underline that the old claim of a 100x slowdown can only have been based on one unfortunate mix of unspecified versions and build options.

brada4 · 2024-01-14T23:25:14Z

Thanks, havent seen class of issues in a while and frankly forgot this was open.

brada4 mentioned this issue Nov 26, 2018

Recalibrate _AXPY #1887

Closed

brada4 mentioned this issue Dec 1, 2018

OpenBLAS vs ATLAS performance for MXNET #1897

Closed

brada4 mentioned this issue Dec 29, 2018

BLAS.set_num_threads(1) drastically improves dynamics! performance for Atlas benchmark JuliaRobotics/RigidBodyDynamics.jl#500

Open

martin-frbg mentioned this issue Jan 18, 2019

Shift transition to multithreading towards larger matrix sizes #1971

Merged

brada4 mentioned this issue Apr 16, 2019

[Question]What is the parallelization threshold value for Caxpy benchmark? #2090

Closed

brada4 mentioned this issue Sep 25, 2019

openblas and openmp #2265

Closed

drhpc mentioned this issue May 29, 2021

Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown DrTimothyAldenDavis/SuiteSparse#1

Closed

martin-frbg mentioned this issue Jun 26, 2021

Add lower thresholds for multithreading in POTRF/POTRI and improve the related benchmark #3284

Merged

martin-frbg closed this as completed Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Functions are not guarded against early threading #1886

Functions are not guarded against early threading #1886

brada4 commented Nov 26, 2018 •

edited

Loading

brada4 commented Nov 26, 2018

martin-frbg commented Jan 13, 2019

brada4 commented Feb 5, 2019

martin-frbg commented Feb 6, 2019

drhpc commented May 29, 2021

martin-frbg commented May 29, 2021

brada4 commented May 29, 2021 •

edited

Loading

drhpc commented May 29, 2021

drhpc commented May 29, 2021

brada4 commented May 29, 2021

martin-frbg commented May 29, 2021

martin-frbg commented Jun 2, 2021

drhpc commented Jun 2, 2021 •

edited

Loading

brada4 commented Jun 2, 2021

martin-frbg commented Jun 2, 2021

martin-frbg commented Jun 7, 2021

drhpc commented Jun 7, 2021

drhpc commented Jun 7, 2021 •

edited

Loading

brada4 commented Jun 8, 2021

martin-frbg commented Jun 9, 2021

drhpc commented Jun 9, 2021

martin-frbg commented Jun 9, 2021

martin-frbg commented Jun 13, 2021

martin-frbg commented Nov 9, 2021

brada4 commented Nov 10, 2021

martin-frbg commented Nov 10, 2021

martin-frbg commented Nov 10, 2021 •

edited

Loading

brada4 commented Nov 11, 2021

martin-frbg commented Jan 14, 2024

brada4 commented Jan 14, 2024

Functions are not guarded against early threading #1886

Functions are not guarded against early threading #1886

Comments

brada4 commented Nov 26, 2018 • edited Loading

brada4 commented Nov 26, 2018

martin-frbg commented Jan 13, 2019

brada4 commented Feb 5, 2019

martin-frbg commented Feb 6, 2019

drhpc commented May 29, 2021

martin-frbg commented May 29, 2021

brada4 commented May 29, 2021 • edited Loading

drhpc commented May 29, 2021

drhpc commented May 29, 2021

brada4 commented May 29, 2021

martin-frbg commented May 29, 2021

martin-frbg commented Jun 2, 2021

drhpc commented Jun 2, 2021 • edited Loading

brada4 commented Jun 2, 2021

martin-frbg commented Jun 2, 2021

martin-frbg commented Jun 7, 2021

drhpc commented Jun 7, 2021

drhpc commented Jun 7, 2021 • edited Loading

brada4 commented Jun 8, 2021

martin-frbg commented Jun 9, 2021

drhpc commented Jun 9, 2021

martin-frbg commented Jun 9, 2021

martin-frbg commented Jun 13, 2021

martin-frbg commented Nov 9, 2021

brada4 commented Nov 10, 2021

martin-frbg commented Nov 10, 2021

martin-frbg commented Nov 10, 2021 • edited Loading

brada4 commented Nov 11, 2021

martin-frbg commented Jan 14, 2024

brada4 commented Jan 14, 2024

brada4 commented Nov 26, 2018 •

edited

Loading

brada4 commented May 29, 2021 •

edited

Loading

drhpc commented Jun 2, 2021 •

edited

Loading

drhpc commented Jun 7, 2021 •

edited

Loading

martin-frbg commented Nov 10, 2021 •

edited

Loading