Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finer control over the # of threads used by any given call to OpenBLAS #2392

Closed
DrTimothyAldenDavis opened this issue Feb 8, 2020 · 11 comments · Fixed by #4425
Closed

finer control over the # of threads used by any given call to OpenBLAS #2392

DrTimothyAldenDavis opened this issue Feb 8, 2020 · 11 comments · Fixed by #4425

Comments

@DrTimothyAldenDavis
Copy link

I need finer control of the # of threads that each call to the BLAS uses. My packages are themselves multithreaded. Each of my threads can make its own calls to the BLAS, in parallel. Some of those calls will be for small matrices, or for other cases where I know I need to use one thread. Other calls will want to use, say, 2 threads and no more (because I'm using threads elsewhere). Sometimes I want to use all the threads available. However, I cannot set a global setting, such as with openblas_set_num_threads, since that would affect all of my calls to the BLAS. My packages themselves are used in other parallel packages. So I need a thread-local way to set the # of threads that OpenBLAS uses, much like this function: https://software.intel.com/en-us/mkl-developer-reference-c-mkl-set-num-threads-local so that I can exactly control how many threads OpenBLAS uses, for each call to the BLAS.

The OpenBLAS cannot always assume it has all the threads available to it, since there are other things going on.

See also this discussion: DrTimothyAldenDavis/SuiteSparse#1

Is this possible with OpenBLAS? Is there an OpenBLAS equivalent to mkl_set_num_threads_local?

@brada4
Copy link
Contributor

brada4 commented Feb 8, 2020

If you use OMP then OMP OpenBLAS under the hood (like one available with Debian/Ubuntu) , it falls back to single thread when it detects being in parallel section.

Other way with thread (default from make) build with recent versions is to set global variable, make the call that should be limited, then set another for another (hardly any lock to avoid races in suggested code path though)

Tell the improvement you desire over which case!

In principle, if there is any specific procedure needed for optimal integration - it can be appended to current FAQ or as a new wiki item.

@DrTimothyAldenDavis
Copy link
Author

DrTimothyAldenDavis commented Feb 8, 2020

I see. This is a common problem I have with many BLAS packages. The problem is the assumption that the user package is either parallel, and thus all calls to the BLAS "must" be single-threaded, or that the user package wants to let the BLAS to all the parallelism itself.

That assumption breaks down in many cases.

Consider the case when computing a sparse LU factorization via a multifrontal method. The method constructs a tree, where each node uses a few calls to the BLAS (like dgemm, and the upper/lower triangular solves) to do a local dense LU factorization. At the leaves of the tree, the nodes are tiny (2-by-2 can occur) so of course only a single thread should be used inside a call to the BLAS. There could be thousands or millions of such tiny BLAS calls. I may want to use parallel BLAS soon, but not for these thousands of BLAS calls. The BLAS must use one thread, efficiently, and it should not spin-wait, waiting for work to do (I have lots of work for the threads to do, in parallel factorizations of entire subtrees, where each of my threads is factorizing all the many nodes in a single subtree).

In the middle of the tree, I have many fronts I can factorize in parallel, and the fronts get bigger. So if I (say) have 8 threads available, and I'm working on 4 subtrees, I may want to use 4 threads in one front, 2 in another, and 1 each in two other fronts. Just as an example; the parallelism is very dynamic.

At the top of the tree, there is typically a large front, where all the threads need to work together in a single BLAS call.

Another case occurs in my GraphBLAS package (a sparse matrix library). The user may have lots of work going on in parallel; they can call GraphBLAS and tell me, per call, how many threads to use in that particular call to GraphBLAS. Then each of my functions may want to use the dense CBLAS on small submatrices or vectors, using a single thread each (but do so in parallel), or I may want to use all of my threads in a single call to the BLAS.

So there is a mix of parallellism, on at least 3 levels: the user application, my library, and the BLAS. I'm in the middle, and I need to be told how many threads to use (at max; I can auto-tune downwards if need be). And I thus need to tell the BLAS, beneath me, how many threads it can use, per call.

@martin-frbg
Copy link
Collaborator

Returning to this, while I cannot offer a short-term solution I note that most of the BLAS functions in OpenBLAS "know" to limit themselves to a single thread if the workload is tiny. The spin-wait issue should be (mostly) adressable by setting a very small value for THREAD_TIMEOUT at compilation.

@jjerphan
Copy link

jjerphan commented Nov 3, 2023

If you are using python, threadpoolctl might be useful to dynamically change the number of threads libraries (including BLAS' implementations) use.

You might be able to translate its implementation into other languages.

@KristofferC
Copy link

At the leaves of the tree, the nodes are tiny (2-by-2 can occur) so of course only a single thread should be used inside a call to the BLAS. There could be thousands or millions of such tiny BLAS calls.

Out of curiosity, for these very small base cases, could it be beneficial to have optimized "hand coded" routines that could e.g. be inlined and unrolled to avoid all the overhead of calling into BLAS.

@DrTimothyAldenDavis
Copy link
Author

Out of curiosity, for these very small base cases, could it be beneficial to have optimized "hand coded" routines that could e.g. be inlined and unrolled to avoid all the overhead of calling into BLAS.

Perhaps, but the threshold would be delicate, and I spend very little total time in those "small BLAS" when they are done properly (in a single thread). I just need them not to be done with too many threads, which really slows things down.

@DrTimothyAldenDavis
Copy link
Author

If you are using python, threadpoolctl might be useful to dynamically change the number of threads libraries (including BLAS' implementations) use.

You might be able to translate its implementation into other languages.

I took a look but I don't think this would work. I would need to be able to check at link-time which BLAS library is being linked in. That's something python can do, I suppose, but my SuiteSparse libraries are just simple libwhatever.so and I don't control the loading and linking of dynamic libraries. Someone else handles that when using SuiteSparse in some application.

I do know how to write a library that dynamically loads and links in another library (I do that in GraphBLAS, which has its own JIT where it compiles new code at run time). So in theory, I could find library-dependent functions at link time, like openmp_get_numthreads and so on. But that would require me to control the loading and linking of the BLAS. I don't have that flexibility. The end user application links in the BLAS itself, well before any call to any method in SuiteSparse.

@martin-frbg
Copy link
Collaborator

should no longer be necessary for OpenBLAS starting from the upcoming 0.3.27 (#4441 being the relevant PR) - at least in all my tests with big files from your matrix collection it was always spurious multithreading in GEMV that made OpenBLAS slower than MKL (although I never saw the kind of horrendous slowdown that had been reported back in 2019)

@DrTimothyAldenDavis
Copy link
Author

Great! Shall I tag these issues as solved? DrTimothyAldenDavis/SuiteSparse#1 and DrTimothyAldenDavis/SuiteSparse#34 , as far as you can tell?

@martin-frbg
Copy link
Collaborator

If you trust my assessment - but I guess they could be reopened if anyone complains. The 0.3.27 release is currently slated for end of March, but will probably happen earlier unless real-life issues flare up again.

@DrTimothyAldenDavis
Copy link
Author

Sure! It would be ideal if we could replicate a before-and-after triage of the problem, but I'm OK with considering this fixed unless it flares up again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants