finer control over the # of threads used by any given call to OpenBLAS #2392

DrTimothyAldenDavis · 2020-02-08T01:59:47Z

I need finer control of the # of threads that each call to the BLAS uses. My packages are themselves multithreaded. Each of my threads can make its own calls to the BLAS, in parallel. Some of those calls will be for small matrices, or for other cases where I know I need to use one thread. Other calls will want to use, say, 2 threads and no more (because I'm using threads elsewhere). Sometimes I want to use all the threads available. However, I cannot set a global setting, such as with openblas_set_num_threads, since that would affect all of my calls to the BLAS. My packages themselves are used in other parallel packages. So I need a thread-local way to set the # of threads that OpenBLAS uses, much like this function: https://software.intel.com/en-us/mkl-developer-reference-c-mkl-set-num-threads-local so that I can exactly control how many threads OpenBLAS uses, for each call to the BLAS.

The OpenBLAS cannot always assume it has all the threads available to it, since there are other things going on.

See also this discussion: DrTimothyAldenDavis/SuiteSparse#1

Is this possible with OpenBLAS? Is there an OpenBLAS equivalent to mkl_set_num_threads_local?

brada4 · 2020-02-08T03:01:38Z

If you use OMP then OMP OpenBLAS under the hood (like one available with Debian/Ubuntu) , it falls back to single thread when it detects being in parallel section.

Other way with thread (default from make) build with recent versions is to set global variable, make the call that should be limited, then set another for another (hardly any lock to avoid races in suggested code path though)

Tell the improvement you desire over which case!

In principle, if there is any specific procedure needed for optimal integration - it can be appended to current FAQ or as a new wiki item.

DrTimothyAldenDavis · 2020-02-08T03:16:23Z

I see. This is a common problem I have with many BLAS packages. The problem is the assumption that the user package is either parallel, and thus all calls to the BLAS "must" be single-threaded, or that the user package wants to let the BLAS to all the parallelism itself.

That assumption breaks down in many cases.

Consider the case when computing a sparse LU factorization via a multifrontal method. The method constructs a tree, where each node uses a few calls to the BLAS (like dgemm, and the upper/lower triangular solves) to do a local dense LU factorization. At the leaves of the tree, the nodes are tiny (2-by-2 can occur) so of course only a single thread should be used inside a call to the BLAS. There could be thousands or millions of such tiny BLAS calls. I may want to use parallel BLAS soon, but not for these thousands of BLAS calls. The BLAS must use one thread, efficiently, and it should not spin-wait, waiting for work to do (I have lots of work for the threads to do, in parallel factorizations of entire subtrees, where each of my threads is factorizing all the many nodes in a single subtree).

In the middle of the tree, I have many fronts I can factorize in parallel, and the fronts get bigger. So if I (say) have 8 threads available, and I'm working on 4 subtrees, I may want to use 4 threads in one front, 2 in another, and 1 each in two other fronts. Just as an example; the parallelism is very dynamic.

At the top of the tree, there is typically a large front, where all the threads need to work together in a single BLAS call.

Another case occurs in my GraphBLAS package (a sparse matrix library). The user may have lots of work going on in parallel; they can call GraphBLAS and tell me, per call, how many threads to use in that particular call to GraphBLAS. Then each of my functions may want to use the dense CBLAS on small submatrices or vectors, using a single thread each (but do so in parallel), or I may want to use all of my threads in a single call to the BLAS.

So there is a mix of parallellism, on at least 3 levels: the user application, my library, and the BLAS. I'm in the middle, and I need to be told how many threads to use (at max; I can auto-tune downwards if need be). And I thus need to tell the BLAS, beneath me, how many threads it can use, per call.

martin-frbg · 2020-04-13T19:59:39Z

Returning to this, while I cannot offer a short-term solution I note that most of the BLAS functions in OpenBLAS "know" to limit themselves to a single thread if the workload is tiny. The spin-wait issue should be (mostly) adressable by setting a very small value for THREAD_TIMEOUT at compilation.

jjerphan · 2023-11-03T21:07:43Z

If you are using python, threadpoolctl might be useful to dynamically change the number of threads libraries (including BLAS' implementations) use.

You might be able to translate its implementation into other languages.

KristofferC · 2024-01-14T20:09:57Z

At the leaves of the tree, the nodes are tiny (2-by-2 can occur) so of course only a single thread should be used inside a call to the BLAS. There could be thousands or millions of such tiny BLAS calls.

Out of curiosity, for these very small base cases, could it be beneficial to have optimized "hand coded" routines that could e.g. be inlined and unrolled to avoid all the overhead of calling into BLAS.

DrTimothyAldenDavis · 2024-01-23T13:06:14Z

Out of curiosity, for these very small base cases, could it be beneficial to have optimized "hand coded" routines that could e.g. be inlined and unrolled to avoid all the overhead of calling into BLAS.

Perhaps, but the threshold would be delicate, and I spend very little total time in those "small BLAS" when they are done properly (in a single thread). I just need them not to be done with too many threads, which really slows things down.

DrTimothyAldenDavis · 2024-01-23T13:15:21Z

If you are using python, threadpoolctl might be useful to dynamically change the number of threads libraries (including BLAS' implementations) use.

You might be able to translate its implementation into other languages.

I took a look but I don't think this would work. I would need to be able to check at link-time which BLAS library is being linked in. That's something python can do, I suppose, but my SuiteSparse libraries are just simple libwhatever.so and I don't control the loading and linking of dynamic libraries. Someone else handles that when using SuiteSparse in some application.

I do know how to write a library that dynamically loads and links in another library (I do that in GraphBLAS, which has its own JIT where it compiles new code at run time). So in theory, I could find library-dependent functions at link time, like openmp_get_numthreads and so on. But that would require me to control the loading and linking of the BLAS. I don't have that flexibility. The end user application links in the BLAS itself, well before any call to any method in SuiteSparse.

martin-frbg · 2024-01-23T13:45:35Z

should no longer be necessary for OpenBLAS starting from the upcoming 0.3.27 (#4441 being the relevant PR) - at least in all my tests with big files from your matrix collection it was always spurious multithreading in GEMV that made OpenBLAS slower than MKL (although I never saw the kind of horrendous slowdown that had been reported back in 2019)

DrTimothyAldenDavis · 2024-01-23T16:56:49Z

Great! Shall I tag these issues as solved? DrTimothyAldenDavis/SuiteSparse#1 and DrTimothyAldenDavis/SuiteSparse#34 , as far as you can tell?

martin-frbg · 2024-01-23T17:45:44Z

If you trust my assessment - but I guess they could be reopened if anyone complains. The 0.3.27 release is currently slated for end of March, but will probably happen earlier unless real-life issues flare up again.

DrTimothyAldenDavis · 2024-01-23T18:30:21Z

Sure! It would be ideal if we could replicate a before-and-after triage of the problem, but I'm OK with considering this fixed unless it flares up again.

h-vetinari mentioned this issue Mar 26, 2020

windows build ? conda-forge/suitesparse-feedstock#51

Closed

h-vetinari mentioned this issue Apr 13, 2020

Could you elaborate on the combination of OpenBLAS with multi-threading? #2543

Open

martin-frbg mentioned this issue Jun 26, 2021

Add lower thresholds for multithreading in POTRF/POTRI and improve the related benchmark #3284

Merged

This was referenced Jan 13, 2024

How the control the number of threads of blas functions inside OMP parallel regions #1192

Closed

Add BLAS extension openblas_set_num_threads_local() #4425

Merged

martin-frbg closed this as completed in #4425 Jan 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finer control over the # of threads used by any given call to OpenBLAS #2392

finer control over the # of threads used by any given call to OpenBLAS #2392

DrTimothyAldenDavis commented Feb 8, 2020

brada4 commented Feb 8, 2020 •

edited

Loading

DrTimothyAldenDavis commented Feb 8, 2020 •

edited

Loading

martin-frbg commented Apr 13, 2020

jjerphan commented Nov 3, 2023

KristofferC commented Jan 14, 2024

DrTimothyAldenDavis commented Jan 23, 2024

DrTimothyAldenDavis commented Jan 23, 2024

martin-frbg commented Jan 23, 2024

DrTimothyAldenDavis commented Jan 23, 2024

martin-frbg commented Jan 23, 2024

DrTimothyAldenDavis commented Jan 23, 2024

finer control over the # of threads used by any given call to OpenBLAS #2392

finer control over the # of threads used by any given call to OpenBLAS #2392

Comments

DrTimothyAldenDavis commented Feb 8, 2020

brada4 commented Feb 8, 2020 • edited Loading

DrTimothyAldenDavis commented Feb 8, 2020 • edited Loading

martin-frbg commented Apr 13, 2020

jjerphan commented Nov 3, 2023

KristofferC commented Jan 14, 2024

DrTimothyAldenDavis commented Jan 23, 2024

DrTimothyAldenDavis commented Jan 23, 2024

martin-frbg commented Jan 23, 2024

DrTimothyAldenDavis commented Jan 23, 2024

martin-frbg commented Jan 23, 2024

DrTimothyAldenDavis commented Jan 23, 2024

brada4 commented Feb 8, 2020 •

edited

Loading

DrTimothyAldenDavis commented Feb 8, 2020 •

edited

Loading