-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenBLAS vs ATLAS performance for MXNET #1897
Comments
Which version of OpenBLAS is this, and on what kind of hardware ? (And if multithreading, can you tell how many threads are used in both cases ?) |
Could you set OPENBLAS_NUM_THREADS=1 and run openblas sample for another time? It could be slight misconfiguration that you use pthread version in openmp caller - problem yielding N^2 BLAS threads. |
@vrakesh could you try with develop version? pth locks in blas_thread_server were O(n^2) becoming painfully visible for lots of CPU cores. perf record 0.3.3 pthreads on 1-core 2ht ATOM (none reaches perf timers with 1 thread openblas EDIT: 0.1% thread server with develop version and none of others caught)
i.e 0.02% of productive computation wrapped with 1.33% (that potentially quadruples doubling number of CPUs) of thread orchestration. Also something is weird with libmxnet.so linking to openblas that actually spurious liblapack.so import is added (it is a copy of same openblas in my case, yields "undefined brhaviour" when that is different from BLAS imported) Followeed you to the letter except inventory EDIT Leave those need to be guarded against early threading #1886
|
Really appreciate the quick response guys, Here is some more information, requested by @martin-frbg Hardware Configuration
OpenBLAS versions tried : 0.2.18 and 0.3.3, yields similar problems @brada4 Thank you for you suggestions and analysis, I will try out you suggestions and see if yields a different result. |
We need the processor trade name, there are plenty of 3.0GHz Xeons, at least one in each generation. e.g. one marketing "model name" line or section from last core from First test is to set OPENBLAS_NUM_THREADS=1 and see that times normalize to (hopefully) better than ATLAS.
Note that present test sample with pthread build in OMP parallel section does milliseconds long ncpu^2 thread oversubscription, bigger sized samples will slow down because of that, the worse the more cores you have in your CPU. |
also if you have a recent xeon that supports AVX512, you should see roughly a 2x speedup with 0.3.4 |
The trade name is Xeon(R) Platinum 8124M , I will try out the new version as well. Thanks again, will get back with results, based on suggestions |
please make sure to compile for skylakex
…On Mon, Dec 3, 2018, 19:35 Rakesh Vasudevan ***@***.*** wrote:
The trade name is Xeon(R) Platinum 8124M , I will try out the new version
as well. Thanks again, will get back with results, based on suggestions
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1897 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFVoG_FDnXtYkmxm_uhErr4uG0fPrks5u1W7xgaJpZM4Y8pdi>
.
|
Hi all,
Makefile.L3:532: recipe for target 'sgemm_kernel.o' failed
make[1]: *** [sgemm_kernel.o] Error 1
make[1]: *** Waiting for unfinished jobs....
../kernel/x86_64/sgemm_ncopy_4_skylakex.c: In function ‘sgemm_oncopy’:
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:36: warning: unused variable ‘ctemp16’ [-Wunused-variable]
FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:27: warning: unused variable ‘ctemp15’ [-Wunused-variable]
FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:18: warning: unused variable ‘ctemp14’ [-Wunused-variable]
FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:36: warning: unused variable ‘ctemp12’ [-Wunused-variable]
FLOAT ctemp9, ctemp10, ctemp11, ctemp12;
^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:27: warning: unused variable ‘ctemp11’ [-Wunused-variable]
FLOAT ctemp9, ctemp10, ctemp11, ctemp12;
^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:18: warning: unused variable ‘ctemp10’ [-Wunused-variable]
FLOAT ctemp9, ctemp10, ctemp11, ctemp12;
^
../kernel/x86_64/sgemm_beta_skylakex.c: In function ‘sgemm_beta’:
../kernel/x86_64/sgemm_beta_skylakex.c:67:12: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
z_zero = _mm512_setzero_ps();
^
../kernel/x86_64/sgemm_beta_skylakex.c:68:12: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
y_zero = _mm256_setzero_ps();
^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:41:0,
from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avxintrin.h:1202:1: error: inlining failed in call to always_inline ‘_mm256_setzero_ps’: target specific option mismatch
_mm256_setzero_ps (void)
^
../kernel/x86_64/sgemm_beta_skylakex.c:68:12: error: called from here
y_zero = _mm256_setzero_ps();
^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:237:1: error: inlining failed in call to always_inline ‘_mm512_setzero_ps’: target specific option mismatch
_mm512_setzero_ps (void)
^
../kernel/x86_64/sgemm_beta_skylakex.c:67:12: error: called from here
z_zero = _mm512_setzero_ps();
^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:5746:1: error: inlining failed in call to always_inline ‘_mm512_storeu_ps’: target specific option mismatch
_mm512_storeu_ps (void *__P, __m512 __A)
^
../kernel/x86_64/sgemm_beta_skylakex.c:78:4: error: called from here
_mm512_storeu_ps(c_offset1 + 16, z_zero);
^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:5746:1: error: inlining failed in call to always_inline ‘_mm512_storeu_ps’: target specific option mismatch
_mm512_storeu_ps (void *__P, __m512 __A) Command that was run make -j $(nproc) TARGET=SKYLAKEX The full /cat/proc cpuinfo processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping : 3
microcode : 0x1000141
cpu MHz : 3000.000
cache size : 25344 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping : 3
microcode : 0x1000141
cpu MHz : 3000.000
cache size : 25344 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 2
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping : 3
microcode : 0x1000141
cpu MHz : 3000.000
cache size : 25344 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 4
initial apicid : 4
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 3
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping : 3
microcode : 0x1000141
cpu MHz : 3000.000
cache size : 25344 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 6
initial apicid : 6
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 4
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping : 3
microcode : 0x1000141
cpu MHz : 3000.000
cache size : 25344 KB
physical id : 0
siblings : 8
core id : 0
cpu cores : 4
apicid : 1
initial apicid : 1
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 5
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping : 3
microcode : 0x1000141
cpu MHz : 3000.000
cache size : 25344 KB
physical id : 0
siblings : 8
core id : 1
cpu cores : 4
apicid : 3
initial apicid : 3
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 6
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping : 3
microcode : 0x1000141
cpu MHz : 3000.000
cache size : 25344 KB
physical id : 0
siblings : 8
core id : 2
cpu cores : 4
apicid : 5
initial apicid : 5
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 85
model name : Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping : 3
microcode : 0x1000141
cpu MHz : 3000.000
cache size : 25344 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips : 6000.00
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management: |
The error looks as if your CFLAGS did not contain |
My assembler is GNU assembler (GNU Binutils for Ubuntu) 2.26.1
Copyright (C) 2015 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-linux-gnu'. What is the minimum required version? I will try getting the latest GCC toolset, and see if it resolves the build issue |
#1797 |
Thanks to all your inputs was able to compile dev version with AVX-512. Good new is that with AVX-512 and thread settings, the performance improves 1.3ms , but still slightly behind 0.8ms of ATLAS. Will be glad to incorporate any other suggestions. |
Which suggestion? |
I have already incorporated those changes as part of the changes , was curious to know if there was something else we could consider. Either way the performance has improved, will try this as part of a bigger network, and post an update here in the issue. |
I have been working on small matrix perf improvements but this seems to not
be that
…On Tue, Dec 4, 2018, 19:03 Rakesh Vasudevan ***@***.*** wrote:
I have already incorporated those changes as part of the changes , was
curious to know if there was something else we could consider. Either way
the performance has improved, will try this as part of a bigger network,
and update the post
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#1897 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFcwqMCnPazlqr0hZPWyQtcv2CLGTks5u1rjggaJpZM4Y8pdi>
.
|
You could try building with USE_TLS=1 to get the experimental thread-local storage allocator that should reduce thread creation overhead. Not sure if that is ready for production use however. |
I do not understand changes made. Setting OPENBLAS_NUM_THREADS=1 should have immediate easily comparable effect on the sample being run. |
I actually sym linked the new openblas, with usr/lib/libopenblas, removing the older link by default, I also did a ldd check once the libmxnet was built, it it using the newer blas I built. |
(poking at your test case) the matrix multiply that is done has parameters I'll add these dimensions to my testbed |
My only point is that reducing threads to 1 would remove burden of (unnecessary in mxnet OpenMP context) thread management. |
(512 512 with a K of 1 is a bit of an odd matrix...) I wonder if the code you have does not actually use sgemm but a vector dot product instead for the main of the code. |
Could be that ATLAS forwards the sgemm call to sgemv. This has been noted as a potential optimization before, but not yet been implemented in OpenBLAS. |
+vrakesh if you're adventurous, could you try the patch at http://git.fenrus.org/tmp/SMALL2.patch ? (Martin: that is not yet for merge, but it's a component of a next optimization I'm working on for small (< 192x192x192) matrix sgemm) |
@fenrus75 I can try that out and see. 512 x 512 x1 is the last fully connected layer of a bigger network we use, it was slowing down when we ran it on openblas. @brada4 Ah yes I agree, but, it seems like it does not help much based on my test runs, I actually tried the tests in a couple of different hardware setups as well. Will see if fenurs75's patch helps. |
My patch helps some sizes a lot but hurts big square like matrices. The
work left is pick this new codepsth only where it wins... But 512 512 1 it
seemed to help
…On Wed, Dec 5, 2018 at 20:01 Rakesh Vasudevan ***@***.***> wrote:
@fenrus75 <https://github.com/fenrus75> I can try that out and see. 512 x
512 x1 is the last fully connected layer of a bigger network we use, it was
slowing down when we ran it on openblas.
@brada4 <https://github.com/brada4> Ah yes I agree, but yes, it seems
like it does not help much based on my tests run, I actually tried the
tests in a couple of different setups as well. Will see if fenurs75's patch
helps.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1897 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFUTk37vJ1b1BQm5jUNvTPQALAqFdks5u2BfugaJpZM4Y8pdi>
.
|
Tried out fenrus75's patch, I can get upto 45% better performance than CBLAS with this change, for this particular use case. A step in the right direction I guess |
You know the saying, 45% here, 45% there, and pretty soon you're starting
to talk real performance gains
…On Thu, Dec 6, 2018 at 01:35 Rakesh Vasudevan ***@***.***> wrote:
Tried out fenrus75's patch, I can get upto 45% better performance than
CBLAS with this change, for this particular use case. A step in the right
direction I guess
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1897 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFWudNJW9wbIfPyGO9wkmsy_kHkQMks5u2GZVgaJpZM4Y8pdi>
.
|
@fenrus75, do you have any ideas on a timeline when you might be able to adopt your patch into master? we (I am working with vrakesh@) are looking at a timeline of when we can consume OpenBLAS with the fix applied to mitigate MXNet performance degradation for RNNs... |
I guess priority just got bumped to this weekend??
…On Fri, Dec 7, 2018 at 10:52 Denis Davydenko ***@***.***> wrote:
@fenrus75 <https://github.com/fenrus75>, do you have any ideas on a
timeline when you might be able to adopt your patch into master? we (I am
working with vrakesh@) are looking at a timeline of when we can consume
OpenBLAS with the fix applied to mitigate MXNet performance degradation for
RNNs...
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1897 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABPeFf7p2gHY2OyZEwwAXX-YXRZ4_BSTks5u2rjpgaJpZM4Y8pdi>
.
|
Not sure I understood your answer, @fenrus75. If you could share some timeline that would definitely help. Am just trying to set expectations within the team and ETA would help. |
@ddavydenko could uou please first make sure to help your colleague to isolate problem, i.e. if it is core oversubscription, sub-optimal work splitter, or really only the missing skylakex code. |
@ddavydenko pull request made. |
awesome, @fenrus75 , appreciate the efforts! |
@fenrus75 This is awesome, really appreciate it thank you. :) |
@fenrus75 @martin-frbg - when are you planning to cut a release with this fix? |
I don't have the privs to make a release; that's martin. I just do things to make x86 cpus go faster ;-) (fwiw this weekend Martin helped me land some optimizations for AVX2 that will also help this case on pre-SKX systems) |
^^^ Sounds like a decent line of business to me 🥇 Will wait for @martin-frbg feedback on the question... |
@lupesko may I make a wild guess that it will be planned when this fix reaches (at least modern) low-end platforms suitable for MXNET? |
Next release is currently planned for Dec 31st (see https://github.com/xianyi/OpenBLAS/milestones) |
Thanks - I will work with the MXNet community to test and confirm we see enhanced performance once this is released. |
@lupesko are initially presented timings now to call community via release notes? |
@brada4 can you please clarify your question? |
Just need quantize vs initial input that effort is worth the matches. |
Checking back in on this one. Any plans to include this in particular OpenBLAS release in near future? |
0.3.5 was released on Dec 31 as planned and includes it. Note however that the Julia folks seem to have found a bug in the SkylakeX DGEMM microkernel (#1955) that was added in 0.3.4. |
So did either of you give 0.3.5 a try yet, and if so, did you experience any problems ? |
The issue has been resolved :) thanks a lot . OpenBLAS is now better than ATLAS, in the particular operator. |
In most cases openBLAS has been performant than ATLAS, when used with deep learning operators of MXNET.
But recently I discovered that one of the mxnet operators, FullyConnected Operator, was performing worse on when using OpenBLAS vs ATLAS.
I used the mxnet profiler to identify the issue.
To reproduce
export MXNET_PROFILER_AUTOSTART=1
This should spit out the performance of all operators of network with CBLAS
Fullyconnected takes 0.823ms
This should spit out the performance of all operators of network with CBLAS
Fullyconnected takes 2.623ms
Openblas version takes nearly 3 times more to compute, this problem compounds in huge networks.
Why is this the case?
Underneath fully connected we call the CBLAS_GEMM function with the shapes above.
Looking for insight from the openblas community
The text was updated successfully, but these errors were encountered: