Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Functions are not guarded against early threading #1886

Closed
brada4 opened this issue Nov 26, 2018 · 30 comments
Closed

Functions are not guarded against early threading #1886

brada4 opened this issue Nov 26, 2018 · 30 comments

Comments

@brada4
Copy link
Contributor

brada4 commented Nov 26, 2018

Not exhaustive list, just those used in practice, sort of measurement takes some time

Discovered in #1882 - dgetrf zherk dsyrk
From #1883 - axpy
From #1897 (mxnet) cblas_?syrk _potri_
From JuliaRobotics/RigidBodyDynamics.jl#500 trsm

@brada4
Copy link
Contributor Author

brada4 commented Nov 26, 2018

Probably worth making into task list, sorting by absent guard and suspected wrong guard.

@martin-frbg
Copy link
Collaborator

Based on preliminary benchmarks it may make sense to double the multithreading thresholds in interface/trsm.c for the two COMPLEX cases and raise it to four or five times the current limit (of 2*GEMM_MULTITHREAD_THRESHOLD) for float and double.

@brada4
Copy link
Contributor Author

brada4 commented Feb 5, 2019

#1989 (comment) has test case vs dsymv via lapack, there is no performance regression, but no thread threshold either.

@martin-frbg
Copy link
Collaborator

#1115 is relevant for dsyrk (and suggests that level3 unrolling is an important factor besides the multithreading thresholds or lack thereof in the interface)

@drhpc
Copy link
Contributor

drhpc commented May 29, 2021

Any progress on this? I am trying to get a grip on DrTimothyAldenDavis/SuiteSparse#1 to decide if it is worth it packaging Octave with sparse matrix computations via SuiteSparse and OpenBLAS in pkgsrc. It's quite a bummer to have the prominent notice not to use OpenBLAS for SuiteSparse.

For now I do not even know how real the threat of abysmal performance is (factor 1000?!). Uncertainty, doubt. It's an unfortunate situation that doesn't seem to see progress.

@martin-frbg
Copy link
Collaborator

I'm just now experimenting with a few things (#3252 that is a bit broken on arm64 right now)

@brada4
Copy link
Contributor Author

brada4 commented May 29, 2021

@drhpc the initial idea is fairly simple to draw the line in the sand where computing switches from one to all cpus, it might be slightlt dated with more and more cores per CPU appear.

cholmod manages threads via (G|I)OMP, and calls these:

    liblapack.so.3 => zpotrf_,dpotrf_
    libopenblas_pthreads.so.0 => dtrsm_,ztrsm_,dgemm_,dtrsv_,ztrsv_,zgemv_,zherk_,dgemv_,dsyrk_,zgemm_

@drhpc
Copy link
Contributor

drhpc commented May 29, 2021

So you actually need some formular or table to use 1, 2, 4, … $many cores depending on data size? Lots of heuristics have to go into that:-/ Right now we still got 16 cores per node in the cluster, but soon that will probably be 128. Some steps in between seem sensible.

@drhpc
Copy link
Contributor

drhpc commented May 29, 2021

My issue right now is that I'd like to see an actual little test that checks how badly we are actually affected by such. I am no user of cholmod myself. Those (vague) reports about deadlock-like situations with factor 1000 speed-down are frightening.

And frankly, I got a colleague who has deep trouble understanding why you would call an optimized/parallelized dense BLAS on bits of sparse matrices anyway. But apparently people do that. We both are outside our realms of experience there.

@brada4
Copy link
Contributor Author

brada4 commented May 29, 2021

I moved to other discipline of computers, If you have time, at least give it a try:

  • In interfaces/ pick one that does not have thread threshold, so you cannot make it worse than it is...
  • drill respective benchmark/one-threaded, all-threaded, overlay graphs, find roughly what size of tasks starts to gain well from all-threading.
  • Note that Level1 BLAS has one dimension and is easy to do, Level2 has 2, and on Level3 you cannot even draw a 4D picture, you need to model optimum, then track back to input parameters.
  • Most helpful is linux perf record ; perf report , excess threading is marked as something else using more CPU time than blas function.
  • Another is the work splitters right after threading, shall they split in cache-line, memory-page, NUMA stride chunks (that would limit threading for super-small samples by other constraint)
  • Another - if L2 caches communicate between them at slower than own speed, mostly yes, It might be worth to ladder core per full cache. e.g worst case would be pumping page over system bus between cores eating out bandwidth from RAM and PCI, happens visibly on low-power SoCs

IMO sparsesuite used pthread version that each openmp thread actually was nproc more threads, and most time was consumed task-switching and pulling data between RAM and cache, but I might be completely wrong.

@martin-frbg
Copy link
Collaborator

I am not sure if heuristics for ramping up the number of threads is what is needed, experiments so far seem to indicate that the added overhead and the need to get it right across a wide range of platforms makes it not worthwile.
In addition to the general overhead of premature multithreading for small cases, I have recently identified a historical oversight where a memory buffer is allocated although it is never used - this is what the PR is trying to address. At least in pthreads-based builds, the speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code.

@martin-frbg
Copy link
Collaborator

My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware. AVX512 may be a slightly different story as that is less well supported in OpenBLAS, but I doubt the apocryphal "a 100 times slower" and I am not inclined to go looking for that without at least some attempt at a reproducer

@drhpc
Copy link
Contributor

drhpc commented Jun 2, 2021

My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware.

I am trying to find out how relevant this issue is for us. It might well only hit for high numbers of threads/cores.

Can you give me the exact test code/compiler commands you used? I'll run it on our machines with up to 40 cores and compare.

@brada4
Copy link
Contributor Author

brada4 commented Jun 2, 2021

CHOLMOD/Demo/Readme.txt explains that.
OpenBLAS must be built typing make.
I confirm Martin's result on early pre-avx512 skylake with 6 cores + HT , on AMD though tables turn other way around.

@martin-frbg
Copy link
Collaborator

I built OpenBLAS with default options (and both with and without USE_OPENMP=1), built SuiteSparse with either make CC=icc CXX=icc or make BLAS="/opt/OpenBLAS/lib -lopenblas" LAPACK="/opt/OpenBLAS/lib -lopenblas. Then mainly ran cholmod_demo <Matrix/bcsstk01.tri from its CHOLMOD/Demo subdirectory. (Did not see much variation in results for OpenBLAS versions ranging from 0.2.17 to today, but no dramatic improvements from my proposed patches either.)
Have not tried anything bigger than the old hexacore i7-8700K in my office, may find time to run it on Ryzen 3900X or package the whole thing for CI. (On AMD you need to set an environment variable to make MKL skip its cpu checks, else you'll end up
in a crawling "compatibilty mode" when it sees a non-Intel cpu)

@martin-frbg
Copy link
Collaborator

Davis states he cannot remember when or which version, also expects the provided matrix files in the cholmod demo to be much too small to reproduce the problem. I see about 20 percent advantage for MKL with bigger files from his sparse matrix collection now (e.g. the 18000x18000 nd6k.mtx that is referenced in the "gpu.sh" example or the 46000x46000 bcsstk39.mtx), the main culprit appears to be SYRK (which is not affected by my experimental patches as of yet)

@drhpc
Copy link
Contributor

drhpc commented Jun 7, 2021

Sorry for being silent here … but I figured that my tests with these examples wouldn't probably not change much in the picture. 20 % difference definitely is not what he is talking about, we're looking for 100 times slowdown with multiple threads:-/

@drhpc
Copy link
Contributor

drhpc commented Jun 7, 2021

Actually …

IMO sparsesuite used pthread version that each openmp thread actually was nproc more threads, and most time was consumed task-switching and pulling data between RAM and cache, but I might be completely wrong.

… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS?

@brada4
Copy link
Contributor Author

brada4 commented Jun 8, 2021

… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS?

Quite often, at least performance impact stated looks that spectacular. Some performance profile from damage is very welcome here, one can even send screenshot picture that non-productive function eats lots of CPU time, there are some cases where we could re-construct regression source from that.

n^3 overcommitment can be achieved too. Run n processes, then automatic n omp threads in process each prime n pthreads. That would be characterized as "grinding for days without any sensible result while one-cpu docker finishes in X seconds..."

@martin-frbg
Copy link
Collaborator

FWIW, I got our drone.io CI to build and run SuiteSparse with current develop branch on 24-core/48-thread AMD Epyc, and while I do not have any MKL numbers for comparison, at least the OpenBLAS version does not lock up or spend extraordinary time on the 18k-by-18k "nd6k" cholmod test. I could repeat the same test on a 96-core ARM64 server there if desired.

@drhpc
Copy link
Contributor

drhpc commented Jun 9, 2021

And we got Tim Davis digging for that original test case … which will be part of the CI soon, I hope.

One thought: Could the CI on the 24 core EPYC easily do a run with past OpenBLAS versions to test the hypothesis that the introduction of extra locking around 3 years ago influenced things, and maybe the behaviour got fixed in the meantime?

@martin-frbg
Copy link
Collaborator

I could probably massage the yml file into downloading some older release instead of the current head. (And I now believe the claim of the extreme slowdown comes from #2265 (late 2019, but only referring to unspecified prepackaged Ubuntu and RedHat builds of OpenBLAS, which could have been anything from a year earlier, or even a bug with the libgomp or whatever)

@martin-frbg
Copy link
Collaborator

Unfortunately the performance of the CI host turns out to be far too variable to use for benchmarking past releases (more than a factor of ten between individual runs of cholmod <nd6k.mtx with the exact same build)

@martin-frbg
Copy link
Collaborator

A pointer to a test case was just added to the suitesparse issue ticket mentioned above (CHOLMOD of a 60000x60000 matrix, multithreaded run seen spending an inordinate amount of "system" time probably spinning on a lock somewhere)

@brada4
Copy link
Contributor Author

brada4 commented Nov 10, 2021

Normally it is to run 1 and 2 and all NUM_THREADS ans compare top10 in perf record -p <pid> ; perf report
It was sched_yield often, but with that now out i dont know what it could be.
PS almost forgot about this.

@martin-frbg
Copy link
Collaborator

Seems I can only reproduce that when I combine (OpenMP-using) SuiteSparse with a non-OpenMP build of OpenBLAS, when both are OpenMP it is a simple case of finding the optimum number of threads for the problem size. In the broken ("mixed") case,
most time is spent in gomp_barrier_wait_end - not sure if there is anything new to learn from this.

@martin-frbg
Copy link
Collaborator

martin-frbg commented Nov 10, 2021

Came across mlpack/mlpack#1858 and the suggestion to set OMP_WAIT_POLICY=passive solves the problem for the "mixed" run (though in that case single-threading remains 40 percent faster than even running with 2 threads,and using all 6 is 3.4 times slower than single.)

@brada4
Copy link
Contributor Author

brada4 commented Nov 11, 2021

Set threshold something samplesize < pagesize * ncpu without measurement?
I think thats faster , and obvious !damage dragging page between CPU caches....

@martin-frbg
Copy link
Collaborator

closing here as the original topic has been resolved by adding lower bounds and in some cases shortcuts in the 0.3.16 to 0.3.21 timeframe. The suitesparse issue is also tracked in #2265 - I intend to perform another test run with current versions locally and/or in the GCC compile farm but preliminary results underline that the old claim of a 100x slowdown can only have been based on one unfortunate mix of unspecified versions and build options.

@brada4
Copy link
Contributor Author

brada4 commented Jan 14, 2024

Thanks, havent seen class of issues in a while and frankly forgot this was open.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants