Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown #1

ghost · 2019-10-23T13:10:17Z

For version 5.4, Suitesparse_config.mk recommended openblas. For version 5.6, intel mkl is now recommended and the config file looks like wants to link against intel openmp.

TL;DR is the cholmod performance degradation related to the threading library (pthreads vs GNU openmp vs intel openmp)? Can you explain a little more (or just link to a discussion about the issue). Does this issue affect version 5.4 when linked against openblas/pthreads?

Some background about why I'm curious about which threading library to use:

I am a "causal" user of suitesparse mostly as it can be used by other packages such as Sundials, numpy, scikit-umfpack, sparseqr, scikit-sparse, etc...

I've built my own Suitesparse shared libraries (and the downstream packages mentioned above) on linux systems (mostly ubuntu) for several years now - one set of libraries linked against openblas/pthreads and one linked against mkl/GNU openmp. Mostly I stick with openblas/pthreads as I get sufficient performance boost without the head aches related to trying to understand which thread library to use (or how).

However, after noting the (new) recommendation for Suitesparse 5.6 and subsequently trying to work out what to do about it, I've run across discussions like this or this as well as the intel mkl linking recommendations here.

The hobgoblin in my little mind is now screaming at me to be foolishly consistent and either link Suitesparse and all subsequent downstream packages that depend on it against GNU openmp or against intel openmp - but not both (mixed) and certainly not mixed with pthreads.

Is this consistent with your understanding about threading library choice with respect to Suitesparse?

DrTimothyAldenDavis · 2019-10-23T14:01:11Z

openBlas causes a 100x slowdown in CHOLMOD, as compared to the non supernodal method and as compared to the supernodal method with the mkl blas. It’s broken

ghost · 2019-10-23T14:14:56Z

Thank you for responding. I understand that it is broken wrt cholmod, I was hoping to learn a little more about why and if it affects my choice of threading library.

That said, I also understand openblas, mkl, threading libraries and my choice of downstream libraries to use with Suitesparse are not really your concern (and I'm certainly able to figure it out for myself) so I will close this.

Thank you for your efforts on Suitesparse and making it available as opensource.

DrTimothyAldenDavis · 2019-10-24T14:15:41Z

I'd like to keep this as an open issue. SuiteSparse has to play nicely with other libraries, and I don't want its performance to suffer if the wrong mix of libraries is used. I at least need to figure out how to navigation the linking process with multiple threading libraries, and provide some documentation on how end users should link the various libraries.

copyme · 2019-10-28T13:43:23Z

@DrTimothyAldenDavis I think this should be documented in a more visible place. I found this thread by chance and as suggested I recompiled SuiteSparse with MKL, only to find a significant speed improvement. Thanks!

DrTimothyAldenDavis · 2019-10-28T15:58:40Z

I added a comment to this in the top-level README.md file, in bold. This is a recent discovery. I'm hoping to track down the source of this slowdown, and figure out a workaround. It may be a bug in OpenBLAS, or perhaps it's a library build issue. In the meantime, use the Intel MKL.

ViralBShah · 2020-01-08T23:36:09Z

One thing is to turn off multi-threading in openblas for small sizes with a build flag like GEMM_MULTITHREADING_THRESHOLD=50. Does that fix the issue?

gruenich · 2020-05-16T22:56:49Z

Can we improve the title? This issue difficult to spot. I was searching for "performance slowdown with multiple threads".

vorticle · 2020-11-13T09:47:11Z

I don’t know if this is an option, but it seems that BLIS is offering what is needed to remedy this situation:

https://github.com/flame/blis/blob/master/docs/Multithreading.md

They implement a new BLAS-Like interface that allows to specify the number of threads for each call individually. They also seem to offer a BLAS compatibility layer.

drhpc · 2021-05-22T10:54:01Z

I am considering packaging suitesparse with pkgsrc to be used in octave and scipy. We had a aging prototype package for version 4.0.2 so far that I used for building software for my users. I do not know if anyone would be affected by this bug here, I am not using octave or scipy myself.

Do you have a simple test case that I can run to check if our system (16 cores per node) is affected by the OpenBLAS bug with a current build of OpenBLAS with pthread or OpenMP?

DrTimothyAldenDavis · 2021-05-22T23:03:08Z

No, I don't have a test for this bug. I do need to figure out a workaround or at least a test, but I haven't had a chance to do that yet. It's not easy working around bugs in other packages.

…

On Sat, May 22, 2021 at 5:54 AM drhpc ***@***.***> wrote: I am considering packaging suitesparse with pkgsrc to be used in octave and scipy. We had a aging prototype package for version 4.0.2 so far that I used for building software for my users. I do not know if anyone would be affected by this bug here, I am not using octave or scipy myself. Do you have a simple test case that I can run to check if our system (16 cores per node) is affected by the OpenBLAS bug with a current build of OpenBLAS with pthread or OpenMP? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIOO7BNBDOXS4TIRIRMDTO6ENJANCNFSM4JEBB25A> .

drhpc · 2021-05-23T01:30:30Z

As I have no experience with suitesparse, I have to guess … but wouldn't any call to CHOLMOD with some generated test data show the performance degradation when comparing it with differing BLAS implementations (or even just with setting openblas or openmp thread count)? I'd just be happy with some code snipped that triggers the right paths for manual comparison.

It is disheartening to see that this issue is somewhat known on the openblas side and here, but nobody has had the time to address it yet. I am wondering how prevalent the issue is at all.

Also, did this only start to happen with certain versions of suitesparse? Or just when CPU core counts became high enough for people to notice the threading issues with excessive locking?

DrTimothyAldenDavis · 2021-05-24T15:30:18Z

I can see about putting a test case together. The issue would not likely arise in dense matrix computations, and I would think that's were a BLAS library gets tested by its developers. The issue would arise within a sparse multifrontal or supernodal method, which has a huge number of "tiny BLAS" calls, and a smaller number of "medium size", smaller still of "big size" and so on. Most high-performance sparse direct methods are like this, so this performance issue would affect most of them.

…

On Sat, May 22, 2021 at 8:30 PM drhpc ***@***.***> wrote: As I have no experience with suitesparse, I have to guess … but wouldn't any call to CHOLMOD with some generated test data show the performance degradation when comparing it with differing BLAS implementations (or even just with setting openblas or openmp thread count)? I'd just be happy with some code snipped that triggers the right paths for manual comparison. It is disheartening to see that this issue is somewhat known on the openblas side and here, but nobody has had the time to address it yet. I am wondering how prevalent the issue is at all. Also, did this only start to happen with certain versions of suitesparse? Or just when CPU core counts became high enough for people to notice the threading issues with excessive locking? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIOMHE4Z57KDEBVXSHY3TPBLEHANCNFSM4JEBB25A> .

drhpc · 2021-05-25T08:09:21Z

That would be really great! I think when we got a self-contained program that shows the issue (not inside some python binding, octave etc.), it is much more likely to get the OpenBLAS people to fix it. So far it is something they generally have on the radar, but have no grip on.

drhpc · 2021-05-29T17:52:26Z

Nudge from the other side … OpenMathLib/OpenBLAS#1886

I have recently identified a historical oversight where a memory buffer is allocated although it is never used […] speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code.

Does that fit the occurence of the issue with SuiteSparse? 3 Years ago?

DrTimothyAldenDavis · 2021-06-02T19:39:01Z

That sounds about the time that I saw the huge 100x slowdown (it was more than 10x). So perhaps this is fixed now.

…

On Sat, May 29, 2021 at 12:52 PM drhpc ***@***.***> wrote: Nudge from the other side … OpenMathLib/OpenBLAS#1886 <OpenMathLib/OpenBLAS#1886> I have recently identified a historical oversight where a memory buffer is allocated although it is never used […] speed difference can reach a factor of ten, and some of these functions (mostly level 2 BLAS) appear to be on the code path of cholmod. Old versions of OpenBLAS until about three years ago did not employ any locking around buffer allocations so were much faster but wildly thread-unsafe. This may be the origin of the advice in the suitesparse code. Does that fit the occurence of the issue with SuiteSparse? 3 Years ago? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIONVRAX2RRV3KLTMXZDTQESWLANCNFSM4JEBB25A> .

martin-frbg · 2021-06-02T21:02:44Z

Would running the cholmod_demo application with one of the provided test cases from its Matrix directory be expected to reproduce the problem, if it still exists ? Or do you recall it requiring particular conditions or matrix sizes not represented there ?

DrTimothyAldenDavis · 2021-06-02T21:38:51Z

I doubt it. The matrix would need to be larger. I can't recall the conditions or matrix sizes that led to the 100x slowdown.

…

On Wed, Jun 2, 2021 at 4:02 PM Martin Kroeker ***@***.***> wrote: Would running the cholmod_demo application with one of the provided test cases from its Matrix directory be expected to reproduce the problem, if it still exists ? Or do you recall it requiring particular conditions or matrix sizes not represented there ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIOPHZ6AMPAVBAGKMMFDTQ2MAFANCNFSM4JEBB25A> .

drhpc · 2021-06-09T07:39:41Z

We got OpenBLAS people interested, but so far we only have seen things like 20% performance difference compared to MKL.

Are we sure this 100-fold slowdown thread-deadlock thing wasn't some bad interaction of OpenMP/pthreads with too many threads?

We need a test case that demonstrates this. Don't we have some SuiteSparse user that would be affected? Is there a forum where you could place a call? For something as strong as the recommendation to stick to MKL here, there needs to be evidence. You know, science and all that;-)

DrTimothyAldenDavis · 2021-06-09T15:13:44Z

I understand the need for a test case. I will try to dig it up. I saw the 100x slowdown myself, in my own tests, but I need to locate that test case.

…

On Wed, Jun 9, 2021 at 2:39 AM drhpc ***@***.***> wrote: We got OpenBLAS people interested, but so far we only have seen things like 20% performance difference compared to MKL. Are we sure this 100-fold slowdown thread-deadlock thing wasn't some bad interaction of OpenMP/pthreads with too many threads? We need a test case that demonstrates this. Don't we have some SuiteSparse user that would be affected? Is there a forum where you could place a call? For something as strong as the recommendation to stick to MKL here, there needs to be evidence. You know, science and all that;-) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEYIIOI3EEX7DYGRCKKDDC3TR4LEZANCNFSM4JEBB25A> .

pjaap · 2021-11-09T14:04:33Z

Hi! I followed this thread and remembered that I posted a hopefully related issue once in the Ubuntu bug tracker:
https://bugs.launchpad.net/ubuntu/+source/suitesparse/+bug/1843277

There is a small test program attached. I just tested it on a Debian 12 machine with libopenblas-dev installed and had a run time factor of about 6 on a 12-thread/6-core CPU. To emphasize: In parallel it takes 6 times longer than sequential!

Maybe this test helps to find the problem? If I remove libopenblas-dev (I guess libatlas-base-dev is used then) the run time is equal both parallel and sequential (beside a 400% vs 100% load).

martin-frbg · 2021-11-09T21:43:23Z

@pjaap thank you very much. Your test case does indeed display pathological behaviour not seen in my earlier CHOLMOD tests with very small and very large matrices. (Now to find where and why threads are spinning instead of doing actual work...)

martin-frbg · 2021-11-10T19:22:41Z

...though absurd slowdown only occurs when a non-OpenMP build of OpenBLAS is paired with SuiteSparse, probably just leading to oversubscription. When both projects are built for OpenMP, it appears to be just an issue of finding the sweet spot between using all or only one core (With the test case on a 6-core Haswell with multithreading disabled, running on all cores takes twice as long as the optimum of running on only two, while a single thread is 20 percent slower. The non-OpenMP OpenBLAS however is ninety times slower on all cores compared to a single one, and does not benefit from multithreading at all.

DrTimothyAldenDavis · 2021-11-16T14:57:52Z

Thanks for the test case. I'll give it a try. I've also revised the title to this issue.

freeseek · 2023-03-08T23:32:18Z

I have encountered the same issue in a completely reproducible way. I used the cholmod_gpu_stats() to display how much time was spent in each function. When CHOLMOD 4.0.3 was linked to MKL I got this:

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          856 time   1.6010e-03
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          633 time   9.2912e-04
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          235 time   4.7159e-04
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          223 time   5.6744e-04
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   3.5691e-03 GPU   0.0000e+00 total:   3.5691e-03
assembly time   0.0000e+00    0.0000e+00

When running with the Debian version of SuiteSparse which uses CHOLMOD 3.0.14 and OpenBLAS:

CHOLMOD GPU/CPU statistics:
SYRK  CPU calls          830 time   3.6409e-03
      GPU calls            0 time   0.0000e+00
GEMM  CPU calls          610 time   1.3313e-03
      GPU calls            0 time   0.0000e+00
POTRF CPU calls          232 time   1.8505e-03
      GPU calls            0 time   0.0000e+00
TRSM  CPU calls          220 time   1.3902e-01
      GPU calls            0 time   0.0000e+00
time in the BLAS: CPU   1.4584e-01 GPU   0.0000e+00 total:   1.4584e-01
assembly time   0.0000e+00    0.0000e+00

The TRSM calls are indeed 100x slower. Is there a way to check at runtime what version of BLAS is being used? Is there a way I could code into my software a check for the presence of this problem so that I could inform the user that he should recompile and link CHOLMOD to a different BLAS implementation? The main issue is that this happens with the default Debian version of SuiteSparse.

martin-frbg · 2023-03-10T10:47:41Z

Same reproducer as in the Ubuntu ticket, or something different ? I really, really want to fix this on the OpenBLAS side. (With the original reproducer, at least the current version realizes that it should run only one thread for the TRSM call, but the overhead from having all the other threads sitting around asking for work still makes it 5x slower. Looks to be a design flaw going back to earliest GotoBLAS, I'm experimenting with actually using openblas_set_num_threads(1) internally - which BTW could be called from SuiteSparse-enabled programs as well already if the workload is known to be small)

freeseek · 2023-03-10T16:03:00Z

@martin-frbg, this is not the same reproducer. Somehow it seems that all of my matrices are affected by this issue. And I had not understood before that CHOLMOD 4 limits the number of threads for the TRSM calls.

For other users bumping into this issue, CHOLMOD 4 has now two new variables in the cholmod_common structure:

    double chunk ;      // chunksize for computing # of threads to use.
                        // Given nwork work to do, # of threads is
                        // max (1, min (floor (work / chunk), nthreads_max))
    int nthreads_max ;  // max # of threads to use in CHOLMOD.  Defaults to
                        // SUITESPARSE_OPENMP_MAX_THREADS.

Which are initialized by cholmod_start() as:

    // -------------------------------------------------------------------------
    // OpenMP initializations
    // -------------------------------------------------------------------------

    Common->chunk = 128000 ;
    Common->nthreads_max = SUITESPARSE_OPENMP_MAX_THREADS ;

Together with the new function cholmod_nthreads():

//==============================================================================
//=== openmp support ===========================================================
//==============================================================================

static inline int cholmod_nthreads  // returns # of OpenMP threads to use
(
    double work,                    // total work to do
    cholmod_common *Common
)
{ 
    #ifdef _OPENMP
    double chunk = Common->chunk ;  // give each thread at least this much work
    int nthreads_max = Common->nthreads_max ;   // max # of threads to use
    if (nthreads_max <= 0)
    {
        nthreads_max = SUITESPARSE_OPENMP_MAX_THREADS ;
    }
    work  = MAX (work, 1) ;
    chunk = MAX (chunk, 1) ;
    int64_t nthreads = (int64_t) floor (work / chunk) ;
    nthreads = MIN (nthreads, nthreads_max) ;
    nthreads = MAX (nthreads, 1) ;
    return ((int) nthreads) ;
    #else
    return (1) ;
    #endif
}

That is now used in t_cholmod_super_numeric.c for the supernodal numeric factorization and the number of threads is then set through:

#pragma omp parallel for num_threads(nthreads)

I am guessing here that CHOLMOD 4 is trying to limit the number of threads for small matrices, to avoid the slowdown issue. Though, for some reasons, this infrastructure is not present in t_cholmod_super_solve.c instead.

So maybe it would be enough to reach out to @svillemot, the maintainer of the suitesparse Debian package, and see if he can update it to a more recent version as the current version is still based on CHOLMOD 3.

martin-frbg · 2023-03-10T16:16:23Z

that may be a different part of the problem, though not directly related to openblas. I'm currently testing with the latest release of Suitesparse built from source, so not affected by any delays in getting new versions of either package into Debian. (btw the openblas provided by debian is possibly a bit older as well)

svillemot · 2023-03-10T18:05:56Z

Debian sid/unstable currently has CHOLMOD 3.0.14 from SuiteSparse 5.12.0. For the time being I cannot update it, because Debian is currently frozen (preparing for the release of Debian “Bookworm” 12).
Regarding OpenBLAS, Debian sid/unstable has version 0.3.21, which is up-to-date.

DrTimothyAldenDavis · 2024-01-23T17:01:02Z

This issue is likely fixed by OpenMathLib/OpenBLAS#4441 which fixes OpenMathLib/OpenBLAS#2265 and OpenMathLib/OpenBLAS#2846 . The fix will appear on OpenBLAS 0.3.27.

ghost closed this as completed Oct 23, 2019

DrTimothyAldenDavis reopened this Oct 24, 2019

copyme mentioned this issue Oct 28, 2019

CX_SPARSE faster than SUITE_SPARSE? ceres-solver/ceres-solver#511

Closed

This was referenced Feb 7, 2020

How the control the number of threads of blas functions inside OMP parallel regions OpenMathLib/OpenBLAS#1192

Closed

finer control over the # of threads used by any given call to OpenBLAS OpenMathLib/OpenBLAS#2392

Closed

jtojnar mentioned this issue Mar 10, 2020

suitesparse: 5.4.0 → 5.7.1 + clean up NixOS/nixpkgs#82274

Merged

10 tasks

h-vetinari mentioned this issue Mar 26, 2020

windows build ? conda-forge/suitesparse-feedstock#51

Closed

drhpc mentioned this issue May 29, 2021

Functions are not guarded against early threading OpenMathLib/OpenBLAS#1886

Closed

DrTimothyAldenDavis changed the title ~~Request for more information about why linking against mkl is now recommended for Suitesparse 5.6.0~~ Linking against Intel MKL BLAS recommend; OpenBLAS can cause 100x slowdown Nov 16, 2021

DrTimothyAldenDavis added the external bug porting issue, or problem with external library or system label Dec 11, 2022

robertgj mentioned this issue Jun 5, 2023

Segfault from t_cholmod_super_numeric.c after call to dpotrf_ #325

Closed

DrTimothyAldenDavis mentioned this issue Oct 9, 2023

BLAS threading preferences when building SuiteSparse #432

Closed

DrTimothyAldenDavis changed the title ~~Linking against Intel MKL BLAS recommend; OpenBLAS can cause 100x slowdown~~ Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown Oct 9, 2023

This was referenced Nov 10, 2023

Errors: No BLAS/LAPACK library loaded! SciML/LinearSolve.jl#427

Closed

request: SuiteSparseMKL JuliaPackaging/Yggdrasil#7661

Closed

syby119 mentioned this issue Dec 2, 2023

Intel MKL gives nan on cholmod_dl_demo for nd6k.mtx as the input. #554

Closed

DrTimothyAldenDavis added probably fixed now but need confirmation and removed probably fixed now but need confirmation labels Jan 23, 2024

DrTimothyAldenDavis closed this as completed Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown #1

Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown #1

ghost commented Oct 23, 2019 •

edited by ghost

Loading

DrTimothyAldenDavis commented Oct 23, 2019 via email •

edited

Loading

ghost commented Oct 23, 2019

DrTimothyAldenDavis commented Oct 24, 2019

copyme commented Oct 28, 2019 •

edited

Loading

DrTimothyAldenDavis commented Oct 28, 2019

ViralBShah commented Jan 8, 2020

gruenich commented May 16, 2020

vorticle commented Nov 13, 2020

drhpc commented May 22, 2021

DrTimothyAldenDavis commented May 22, 2021 via email

drhpc commented May 23, 2021

DrTimothyAldenDavis commented May 24, 2021 via email

drhpc commented May 25, 2021

drhpc commented May 29, 2021

DrTimothyAldenDavis commented Jun 2, 2021 via email

martin-frbg commented Jun 2, 2021

DrTimothyAldenDavis commented Jun 2, 2021 via email

drhpc commented Jun 9, 2021

DrTimothyAldenDavis commented Jun 9, 2021 via email

pjaap commented Nov 9, 2021

martin-frbg commented Nov 9, 2021

martin-frbg commented Nov 10, 2021

DrTimothyAldenDavis commented Nov 16, 2021

freeseek commented Mar 8, 2023

martin-frbg commented Mar 10, 2023

freeseek commented Mar 10, 2023

martin-frbg commented Mar 10, 2023

svillemot commented Mar 10, 2023

DrTimothyAldenDavis commented Jan 23, 2024 •

edited

Loading

Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown #1

Linking against Intel MKL BLAS recommended; OpenBLAS can cause 100x slowdown #1

Comments

ghost commented Oct 23, 2019 • edited by ghost Loading

DrTimothyAldenDavis commented Oct 23, 2019 via email • edited Loading

ghost commented Oct 23, 2019

DrTimothyAldenDavis commented Oct 24, 2019

copyme commented Oct 28, 2019 • edited Loading

DrTimothyAldenDavis commented Oct 28, 2019

ViralBShah commented Jan 8, 2020

gruenich commented May 16, 2020

vorticle commented Nov 13, 2020

drhpc commented May 22, 2021

DrTimothyAldenDavis commented May 22, 2021 via email

drhpc commented May 23, 2021

DrTimothyAldenDavis commented May 24, 2021 via email

drhpc commented May 25, 2021

drhpc commented May 29, 2021

DrTimothyAldenDavis commented Jun 2, 2021 via email

martin-frbg commented Jun 2, 2021

DrTimothyAldenDavis commented Jun 2, 2021 via email

drhpc commented Jun 9, 2021

DrTimothyAldenDavis commented Jun 9, 2021 via email

pjaap commented Nov 9, 2021

martin-frbg commented Nov 9, 2021

martin-frbg commented Nov 10, 2021

DrTimothyAldenDavis commented Nov 16, 2021

freeseek commented Mar 8, 2023

martin-frbg commented Mar 10, 2023

freeseek commented Mar 10, 2023

martin-frbg commented Mar 10, 2023

svillemot commented Mar 10, 2023

DrTimothyAldenDavis commented Jan 23, 2024 • edited Loading

ghost commented Oct 23, 2019 •

edited by ghost

Loading

DrTimothyAldenDavis commented Oct 23, 2019 via email •

edited

Loading

copyme commented Oct 28, 2019 •

edited

Loading

DrTimothyAldenDavis commented Jan 23, 2024 •

edited

Loading