-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Functions are not guarded against early threading #1886
Comments
Probably worth making into task list, sorting by absent guard and suspected wrong guard. |
Based on preliminary benchmarks it may make sense to double the multithreading thresholds in interface/trsm.c for the two COMPLEX cases and raise it to four or five times the current limit (of 2*GEMM_MULTITHREAD_THRESHOLD) for float and double. |
#1989 (comment) has test case vs |
#1115 is relevant for dsyrk (and suggests that level3 unrolling is an important factor besides the multithreading thresholds or lack thereof in the interface) |
Any progress on this? I am trying to get a grip on DrTimothyAldenDavis/SuiteSparse#1 to decide if it is worth it packaging Octave with sparse matrix computations via SuiteSparse and OpenBLAS in pkgsrc. It's quite a bummer to have the prominent notice not to use OpenBLAS for SuiteSparse. For now I do not even know how real the threat of abysmal performance is (factor 1000?!). Uncertainty, doubt. It's an unfortunate situation that doesn't seem to see progress. |
I'm just now experimenting with a few things (#3252 that is a bit broken on arm64 right now) |
@drhpc the initial idea is fairly simple to draw the line in the sand where computing switches from one to all cpus, it might be slightlt dated with more and more cores per CPU appear. cholmod manages threads via (G|I)OMP, and calls these:
|
So you actually need some formular or table to use 1, 2, 4, … $many cores depending on data size? Lots of heuristics have to go into that:-/ Right now we still got 16 cores per node in the cluster, but soon that will probably be 128. Some steps in between seem sensible. |
My issue right now is that I'd like to see an actual little test that checks how badly we are actually affected by such. I am no user of cholmod myself. Those (vague) reports about deadlock-like situations with factor 1000 speed-down are frightening. And frankly, I got a colleague who has deep trouble understanding why you would call an optimized/parallelized dense BLAS on bits of sparse matrices anyway. But apparently people do that. We both are outside our realms of experience there. |
I moved to other discipline of computers, If you have time, at least give it a try:
IMO sparsesuite used pthread version that each openmp thread actually was |
I am not sure if heuristics for ramping up the number of threads is what is needed, experiments so far seem to indicate that the added overhead and the need to get it right across a wide range of platforms makes it not worthwile. |
My initial tests with the cholmod "demo" files suggest that current MKL (plus Intel compiler optimizations, Intel libc and whatnot) is up to ten percent faster than current, gcc-compiled OpenBLAS on Haswell-generation desktop hardware. AVX512 may be a slightly different story as that is less well supported in OpenBLAS, but I doubt the apocryphal "a 100 times slower" and I am not inclined to go looking for that without at least some attempt at a reproducer |
I am trying to find out how relevant this issue is for us. It might well only hit for high numbers of threads/cores. Can you give me the exact test code/compiler commands you used? I'll run it on our machines with up to 40 cores and compare. |
|
I built OpenBLAS with default options (and both with and without USE_OPENMP=1), built SuiteSparse with either |
Davis states he cannot remember when or which version, also expects the provided matrix files in the cholmod demo to be much too small to reproduce the problem. I see about 20 percent advantage for MKL with bigger files from his sparse matrix collection now (e.g. the 18000x18000 nd6k.mtx that is referenced in the "gpu.sh" example or the 46000x46000 bcsstk39.mtx), the main culprit appears to be SYRK (which is not affected by my experimental patches as of yet) |
Sorry for being silent here … but I figured that my tests with these examples wouldn't probably not change much in the picture. 20 % difference definitely is not what he is talking about, we're looking for 100 times slowdown with multiple threads:-/ |
Actually …
… is that the whole issue? Was someone using an openmp program that multiplied threads by recusively parallel OpenBLAS? |
Quite often, at least performance impact stated looks that spectacular. Some performance profile from damage is very welcome here, one can even send screenshot picture that non-productive function eats lots of CPU time, there are some cases where we could re-construct regression source from that. n^3 overcommitment can be achieved too. Run n processes, then automatic n omp threads in process each prime n pthreads. That would be characterized as "grinding for days without any sensible result while one-cpu docker finishes in X seconds..." |
FWIW, I got our drone.io CI to build and run SuiteSparse with current |
And we got Tim Davis digging for that original test case … which will be part of the CI soon, I hope. One thought: Could the CI on the 24 core EPYC easily do a run with past OpenBLAS versions to test the hypothesis that the introduction of extra locking around 3 years ago influenced things, and maybe the behaviour got fixed in the meantime? |
I could probably massage the yml file into downloading some older release instead of the current head. (And I now believe the claim of the extreme slowdown comes from #2265 (late 2019, but only referring to unspecified prepackaged Ubuntu and RedHat builds of OpenBLAS, which could have been anything from a year earlier, or even a bug with the libgomp or whatever) |
Unfortunately the performance of the CI host turns out to be far too variable to use for benchmarking past releases (more than a factor of ten between individual runs of |
A pointer to a test case was just added to the suitesparse issue ticket mentioned above (CHOLMOD of a 60000x60000 matrix, multithreaded run seen spending an inordinate amount of "system" time probably spinning on a lock somewhere) |
Normally it is to run 1 and 2 and all NUM_THREADS ans compare top10 in |
Seems I can only reproduce that when I combine (OpenMP-using) SuiteSparse with a non-OpenMP build of OpenBLAS, when both are OpenMP it is a simple case of finding the optimum number of threads for the problem size. In the broken ("mixed") case, |
Came across mlpack/mlpack#1858 and the suggestion to set |
Set threshold something samplesize < pagesize * ncpu without measurement? |
closing here as the original topic has been resolved by adding lower bounds and in some cases shortcuts in the 0.3.16 to 0.3.21 timeframe. The suitesparse issue is also tracked in #2265 - I intend to perform another test run with current versions locally and/or in the GCC compile farm but preliminary results underline that the old claim of a 100x slowdown can only have been based on one unfortunate mix of unspecified versions and build options. |
Thanks, havent seen class of issues in a while and frankly forgot this was open. |
Not exhaustive list, just those used in practice, sort of measurement takes some time
Discovered in #1882 -
dgetrf
zherk
dsyrk
From #1883 -
axpy
From #1897 (mxnet)
cblas_?syrk
_potri_
From JuliaRobotics/RigidBodyDynamics.jl#500
trsm
The text was updated successfully, but these errors were encountered: