-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
finer control over the # of threads used by any given call to OpenBLAS #2392
Comments
If you use OMP then OMP OpenBLAS under the hood (like one available with Debian/Ubuntu) , it falls back to single thread when it detects being in parallel section. Other way with thread (default from make) build with recent versions is to set global variable, make the call that should be limited, then set another for another (hardly any lock to avoid races in suggested code path though) Tell the improvement you desire over which case! In principle, if there is any specific procedure needed for optimal integration - it can be appended to current FAQ or as a new wiki item. |
I see. This is a common problem I have with many BLAS packages. The problem is the assumption that the user package is either parallel, and thus all calls to the BLAS "must" be single-threaded, or that the user package wants to let the BLAS to all the parallelism itself. That assumption breaks down in many cases. Consider the case when computing a sparse LU factorization via a multifrontal method. The method constructs a tree, where each node uses a few calls to the BLAS (like dgemm, and the upper/lower triangular solves) to do a local dense LU factorization. At the leaves of the tree, the nodes are tiny (2-by-2 can occur) so of course only a single thread should be used inside a call to the BLAS. There could be thousands or millions of such tiny BLAS calls. I may want to use parallel BLAS soon, but not for these thousands of BLAS calls. The BLAS must use one thread, efficiently, and it should not spin-wait, waiting for work to do (I have lots of work for the threads to do, in parallel factorizations of entire subtrees, where each of my threads is factorizing all the many nodes in a single subtree). In the middle of the tree, I have many fronts I can factorize in parallel, and the fronts get bigger. So if I (say) have 8 threads available, and I'm working on 4 subtrees, I may want to use 4 threads in one front, 2 in another, and 1 each in two other fronts. Just as an example; the parallelism is very dynamic. At the top of the tree, there is typically a large front, where all the threads need to work together in a single BLAS call. Another case occurs in my GraphBLAS package (a sparse matrix library). The user may have lots of work going on in parallel; they can call GraphBLAS and tell me, per call, how many threads to use in that particular call to GraphBLAS. Then each of my functions may want to use the dense CBLAS on small submatrices or vectors, using a single thread each (but do so in parallel), or I may want to use all of my threads in a single call to the BLAS. So there is a mix of parallellism, on at least 3 levels: the user application, my library, and the BLAS. I'm in the middle, and I need to be told how many threads to use (at max; I can auto-tune downwards if need be). And I thus need to tell the BLAS, beneath me, how many threads it can use, per call. |
Returning to this, while I cannot offer a short-term solution I note that most of the BLAS functions in OpenBLAS "know" to limit themselves to a single thread if the workload is tiny. The spin-wait issue should be (mostly) adressable by setting a very small value for THREAD_TIMEOUT at compilation. |
If you are using python, You might be able to translate its implementation into other languages. |
Out of curiosity, for these very small base cases, could it be beneficial to have optimized "hand coded" routines that could e.g. be inlined and unrolled to avoid all the overhead of calling into BLAS. |
Perhaps, but the threshold would be delicate, and I spend very little total time in those "small BLAS" when they are done properly (in a single thread). I just need them not to be done with too many threads, which really slows things down. |
I took a look but I don't think this would work. I would need to be able to check at link-time which BLAS library is being linked in. That's something python can do, I suppose, but my SuiteSparse libraries are just simple libwhatever.so and I don't control the loading and linking of dynamic libraries. Someone else handles that when using SuiteSparse in some application. I do know how to write a library that dynamically loads and links in another library (I do that in GraphBLAS, which has its own JIT where it compiles new code at run time). So in theory, I could find library-dependent functions at link time, like |
should no longer be necessary for OpenBLAS starting from the upcoming 0.3.27 (#4441 being the relevant PR) - at least in all my tests with big files from your matrix collection it was always spurious multithreading in GEMV that made OpenBLAS slower than MKL (although I never saw the kind of horrendous slowdown that had been reported back in 2019) |
Great! Shall I tag these issues as solved? DrTimothyAldenDavis/SuiteSparse#1 and DrTimothyAldenDavis/SuiteSparse#34 , as far as you can tell? |
If you trust my assessment - but I guess they could be reopened if anyone complains. The 0.3.27 release is currently slated for end of March, but will probably happen earlier unless real-life issues flare up again. |
Sure! It would be ideal if we could replicate a before-and-after triage of the problem, but I'm OK with considering this fixed unless it flares up again. |
I need finer control of the # of threads that each call to the BLAS uses. My packages are themselves multithreaded. Each of my threads can make its own calls to the BLAS, in parallel. Some of those calls will be for small matrices, or for other cases where I know I need to use one thread. Other calls will want to use, say, 2 threads and no more (because I'm using threads elsewhere). Sometimes I want to use all the threads available. However, I cannot set a global setting, such as with openblas_set_num_threads, since that would affect all of my calls to the BLAS. My packages themselves are used in other parallel packages. So I need a thread-local way to set the # of threads that OpenBLAS uses, much like this function: https://software.intel.com/en-us/mkl-developer-reference-c-mkl-set-num-threads-local so that I can exactly control how many threads OpenBLAS uses, for each call to the BLAS.
The OpenBLAS cannot always assume it has all the threads available to it, since there are other things going on.
See also this discussion: DrTimothyAldenDavis/SuiteSparse#1
Is this possible with OpenBLAS? Is there an OpenBLAS equivalent to mkl_set_num_threads_local?
The text was updated successfully, but these errors were encountered: