OpenBLAS vs ATLAS performance for MXNET #1897

vrakesh · 2018-12-01T01:42:35Z

In most cases openBLAS has been performant than ATLAS, when used with deep learning operators of MXNET.

But recently I discovered that one of the mxnet operators, FullyConnected Operator, was performing worse on when using OpenBLAS vs ATLAS.

I used the mxnet profiler to identify the issue.

To reproduce

We are using the following python script with a bash variable enabled

export MXNET_PROFILER_AUTOSTART=1

import mxnet as mx
from mxnet import profiler
profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')
data = mx.symbol.Variable('data')
net = mx.sym.FullyConnected(data=data, name='fc1', num_hidden=512)
print(net.list_arguments())
data = mx.nd.random.normal(-1, 1, (512,))
fc1_weight = mx.nd.random.normal(-1, 1, (512, 1))
fc1_bias = mx.nd.random.normal(-1, 1, (512,))
exe = net.bind(ctx=mx.cpu(), args={'data': data, 'fc1_weight': fc1_weight, 'fc1_bias': fc1_bias})
exe.forward()
exe.outputs[0].wait_to_read()
profiler.dump()
print(profiler.dumps())

First compile mxnet with ATLAS

# Execute the following from mxnet root src folder
make -j $(nproc) USE_BLAS=atlas
cd python
python setup.py install

python <run-script-above>.py

This should spit out the performance of all operators of network with CBLAS
Fullyconnected takes 0.823ms

Now compile with OPENBLAS

# Execute the following from mxnet root src folder
make -j $(nproc) USE_BLAS=openblas
cd python
python setup.py install

python <run-script-above>.py

This should spit out the performance of all operators of network with CBLAS
Fullyconnected takes 2.623ms

Openblas version takes nearly 3 times more to compute, this problem compounds in huge networks.

Why is this the case?

Underneath fully connected we call the CBLAS_GEMM function with the shapes above.

Looking for insight from the openblas community

martin-frbg · 2018-12-01T11:40:47Z

Which version of OpenBLAS is this, and on what kind of hardware ? (And if multithreading, can you tell how many threads are used in both cases ?)

brada4 · 2018-12-01T14:47:30Z

Could you set OPENBLAS_NUM_THREADS=1 and run openblas sample for another time?
The actual cblas calls are so short that linux perf does not catch them, but it catches pthread locks and thread server. Also pip install pulled binary numpy with openblas 0.3.0.dev included.

It could be slight misconfiguration that you use pthread version in openmp caller - problem yielding N^2 BLAS threads.

brada4 · 2018-12-01T16:23:42Z

@vrakesh could you try with develop version? pth locks in blas_thread_server were O(n^2) becoming painfully visible for lots of CPU cores.
OMP version (libopenblaso.so on Fedora etc) would use single-threaded version inside parallel sections, pthread version has no way to detect it and goes full speed.

perf record 0.3.3 pthreads on 1-core 2ht ATOM (none reaches perf timers with 1 thread openblas EDIT: 0.1% thread server with develop version and none of others caught)

1.30%  python  [.] blas_thread_server
0.01%  python  [.] pthread_mutex_unlock@plt
0.01%  python  [.] sgemm_beta_ATOM
0.01%  python  [.] sgemm_kernel_ATOM
0.01%  python  [.] sched_yield@plt
0.01%  python  [.] pthread_mutex_lock@plt

i.e 0.02% of productive computation wrapped with 1.33% (that potentially quadruples doubling number of CPUs) of thread orchestration.

Also something is weird with libmxnet.so linking to openblas that actually spurious liblapack.so import is added (it is a copy of same openblas in my case, yields "undefined brhaviour" when that is different from BLAS imported)

Followeed you to the letter except
python setup.py install --prefix=$HOME/.local/

inventory EDIT Leave those need to be guarded against early threading #1886

cblas_dsyrk
cblas_ssyrk
dpotri_
spotri_

vrakesh · 2018-12-02T03:27:16Z

Really appreciate the quick response guys,

Here is some more information, requested by @martin-frbg

Hardware Configuration

Intel Xeon @3.0 Ghz, 4 cores and 8 threads.

OpenBLAS versions tried : 0.2.18 and 0.3.3, yields similar problems

@brada4 Thank you for you suggestions and analysis, I will try out you suggestions and see if yields a different result.

brada4 · 2018-12-02T08:40:25Z

We need the processor trade name, there are plenty of 3.0GHz Xeons, at least one in each generation. e.g. one marketing "model name" line or section from last core from cat /proc/cpuinfo

First test is to set OPENBLAS_NUM_THREADS=1 and see that times normalize to (hopefully) better than ATLAS.

If it improves matters you can make your current pthread build work well with this parameter
You can build OpenMP openblas which does same thread reduction automagically in parallel sections without any specific parameters, but stays parallel in normal code.
or better develop that will soon go as 0.3.4 where thread management overhead is eased

Note that present test sample with pthread build in OMP parallel section does milliseconds long ncpu^2 thread oversubscription, bigger sized samples will slow down because of that, the worse the more cores you have in your CPU.

fenrus75 · 2018-12-03T04:56:35Z

also if you have a recent xeon that supports AVX512, you should see roughly a 2x speedup with 0.3.4
(but do compile openblas for SKYLAKEX to get that)

vrakesh · 2018-12-03T18:35:20Z

The trade name is Xeon(R) Platinum 8124M , I will try out the new version as well. Thanks again, will get back with results, based on suggestions

fenrus75 · 2018-12-03T18:36:57Z

please make sure to compile for skylakex

…

On Mon, Dec 3, 2018, 19:35 Rakesh Vasudevan ***@***.*** wrote: The trade name is Xeon(R) Platinum 8124M , I will try out the new version as well. Thanks again, will get back with results, based on suggestions — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1897 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFVoG_FDnXtYkmxm_uhErr4uG0fPrks5u1W7xgaJpZM4Y8pdi> .

vrakesh · 2018-12-03T22:18:18Z

Hi all,
Here are some updates

The past builds were indeed compiled with SKYLAKEX, the makefile automatically has been detecting it
I tried building 0.3.3 with OPENMP and also tried setting OPENBLAS_NUM_THREADS=1 when not using OPENMP, neither of them seem to resolve the performance difference.
I tried building the development and 0.3.4 release, both of them fail with the following error.

Makefile.L3:532: recipe for target 'sgemm_kernel.o' failed
make[1]: *** [sgemm_kernel.o] Error 1
make[1]: *** Waiting for unfinished jobs....
../kernel/x86_64/sgemm_ncopy_4_skylakex.c: In function ‘sgemm_oncopy’:
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:36: warning: unused variable ‘ctemp16’ [-Wunused-variable]
   FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
                                    ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:27: warning: unused variable ‘ctemp15’ [-Wunused-variable]
   FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
                           ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:18: warning: unused variable ‘ctemp14’ [-Wunused-variable]
   FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
                  ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:36: warning: unused variable ‘ctemp12’ [-Wunused-variable]
   FLOAT  ctemp9, ctemp10, ctemp11, ctemp12;
                                    ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:27: warning: unused variable ‘ctemp11’ [-Wunused-variable]
   FLOAT  ctemp9, ctemp10, ctemp11, ctemp12;
                           ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:18: warning: unused variable ‘ctemp10’ [-Wunused-variable]
   FLOAT  ctemp9, ctemp10, ctemp11, ctemp12;
                  ^
../kernel/x86_64/sgemm_beta_skylakex.c: In function ‘sgemm_beta’:
../kernel/x86_64/sgemm_beta_skylakex.c:67:12: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
     z_zero = _mm512_setzero_ps();
            ^
../kernel/x86_64/sgemm_beta_skylakex.c:68:12: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
     y_zero = _mm256_setzero_ps();
            ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:41:0,
                 from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avxintrin.h:1202:1: error: inlining failed in call to always_inline ‘_mm256_setzero_ps’: target specific option mismatch
 _mm256_setzero_ps (void)
 ^
../kernel/x86_64/sgemm_beta_skylakex.c:68:12: error: called from here
     y_zero = _mm256_setzero_ps();
            ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
                 from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:237:1: error: inlining failed in call to always_inline ‘_mm512_setzero_ps’: target specific option mismatch
 _mm512_setzero_ps (void)
 ^
../kernel/x86_64/sgemm_beta_skylakex.c:67:12: error: called from here
     z_zero = _mm512_setzero_ps();
            ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
                 from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:5746:1: error: inlining failed in call to always_inline ‘_mm512_storeu_ps’: target specific option mismatch
 _mm512_storeu_ps (void *__P, __m512 __A)
 ^
../kernel/x86_64/sgemm_beta_skylakex.c:78:4: error: called from here
    _mm512_storeu_ps(c_offset1 + 16, z_zero);
    ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
                 from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:5746:1: error: inlining failed in call to always_inline ‘_mm512_storeu_ps’: target specific option mismatch
 _mm512_storeu_ps (void *__P, __m512 __A)

Command that was run

make -j $(nproc) TARGET=SKYLAKEX

The full /cat/proc cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 2
cpu cores	: 4
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

martin-frbg · 2018-12-03T22:52:03Z

The error looks as if your CFLAGS did not contain -march=skylake-avx512 to enable AVX512 support in your compiler (strange as this should be auto-added by 0.3.4 at least), or your assembler may be too old.

vrakesh · 2018-12-03T23:00:31Z

My assembler is

GNU assembler (GNU Binutils for Ubuntu) 2.26.1
Copyright (C) 2015 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-linux-gnu'.

What is the minimum required version?

I will try getting the latest GCC toolset, and see if it resolves the build issue

brada4 · 2018-12-03T23:07:41Z

#1797
DYNAMIC_ARCH would detect & disable absent AVX-512 support in compiler. gcc-6 and newer from ubuntu-made ppa (xenial / 16.04 from binutils version tag) will do just fine (also add ppa fortran, mixing version will yield problems few minutes later)
Centos7 binutils 2.25.1 suffice for AVX-512 assembly.

vrakesh · 2018-12-04T03:42:44Z

Thanks to all your inputs was able to compile dev version with AVX-512.

Good new is that with AVX-512 and thread settings, the performance improves 1.3ms , but still slightly behind 0.8ms of ATLAS. Will be glad to incorporate any other suggestions.

brada4 · 2018-12-04T09:17:20Z

Which suggestion?
Setting OPENBLAS_NUM_THREADS to 1
Using openmp build of openblas

vrakesh · 2018-12-04T18:03:05Z

I have already incorporated those changes as part of the changes , was curious to know if there was something else we could consider. Either way the performance has improved, will try this as part of a bigger network, and post an update here in the issue.

fenrus75 · 2018-12-04T18:15:19Z

I have been working on small matrix perf improvements but this seems to not be that

…

On Tue, Dec 4, 2018, 19:03 Rakesh Vasudevan ***@***.*** wrote: I have already incorporated those changes as part of the changes , was curious to know if there was something else we could consider. Either way the performance has improved, will try this as part of a bigger network, and update the post — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#1897 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFcwqMCnPazlqr0hZPWyQtcv2CLGTks5u1rjggaJpZM4Y8pdi> .

martin-frbg · 2018-12-04T18:20:50Z

You could try building with USE_TLS=1 to get the experimental thread-local storage allocator that should reduce thread creation overhead. Not sure if that is ready for production use however.

brada4 · 2018-12-04T18:40:39Z

I do not understand changes made. Setting OPENBLAS_NUM_THREADS=1 should have immediate easily comparable effect on the sample being run.
If you run ldd against binary produced - it may as well link to old ubuntu package of OpenBLAS, which should be quite bad for small GEMM
Very hackish fix would be to LD_PRELOAD freshly built libopenblas*.so so that it certainly overrides imported library and eventually also liblapack.so

vrakesh · 2018-12-04T21:27:48Z

I actually sym linked the new openblas, with usr/lib/libopenblas, removing the older link by default, I also did a ldd check once the libmxnet was built, it it using the newer blas I built.

fenrus75 · 2018-12-05T07:24:41Z

(poking at your test case)

the matrix multiply that is done has parameters
M N K : 512 512 1

I'll add these dimensions to my testbed

brada4 · 2018-12-05T07:30:16Z

My only point is that reducing threads to 1 would remove burden of (unnecessary in mxnet OpenMP context) thread management.
Actually the repeater sample is quite good illustration of the problem.

fenrus75 · 2018-12-05T08:48:52Z

(512 512 with a K of 1 is a bit of an odd matrix...)

I wonder if the code you have does not actually use sgemm but a vector dot product instead for the main of the code.

martin-frbg · 2018-12-05T09:02:10Z

Could be that ATLAS forwards the sgemm call to sgemv. This has been noted as a potential optimization before, but not yet been implemented in OpenBLAS.

fenrus75 · 2018-12-05T09:13:32Z

+vrakesh if you're adventurous, could you try the patch at http://git.fenrus.org/tmp/SMALL2.patch ?

(Martin: that is not yet for merge, but it's a component of a next optimization I'm working on for small (< 192x192x192) matrix sgemm)

vrakesh · 2018-12-05T19:00:19Z

@fenrus75 I can try that out and see. 512 x 512 x1 is the last fully connected layer of a bigger network we use, it was slowing down when we ran it on openblas.

@brada4 Ah yes I agree, but, it seems like it does not help much based on my test runs, I actually tried the tests in a couple of different hardware setups as well. Will see if fenurs75's patch helps.

fenrus75 · 2018-12-05T19:51:27Z

My patch helps some sizes a lot but hurts big square like matrices. The work left is pick this new codepsth only where it wins... But 512 512 1 it seemed to help

…

On Wed, Dec 5, 2018 at 20:01 Rakesh Vasudevan ***@***.***> wrote: @fenrus75 <https://github.com/fenrus75> I can try that out and see. 512 x 512 x1 is the last fully connected layer of a bigger network we use, it was slowing down when we ran it on openblas. @brada4 <https://github.com/brada4> Ah yes I agree, but yes, it seems like it does not help much based on my tests run, I actually tried the tests in a couple of different setups as well. Will see if fenurs75's patch helps. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1897 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFUTk37vJ1b1BQm5jUNvTPQALAqFdks5u2BfugaJpZM4Y8pdi> .

vrakesh · 2018-12-06T00:35:27Z

Tried out fenrus75's patch, I can get upto 45% better performance than CBLAS with this change, for this particular use case. A step in the right direction I guess

fenrus75 · 2018-12-06T05:06:29Z

You know the saying, 45% here, 45% there, and pretty soon you're starting to talk real performance gains

…

On Thu, Dec 6, 2018 at 01:35 Rakesh Vasudevan ***@***.***> wrote: Tried out fenrus75's patch, I can get upto 45% better performance than CBLAS with this change, for this particular use case. A step in the right direction I guess — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1897 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFWudNJW9wbIfPyGO9wkmsy_kHkQMks5u2GZVgaJpZM4Y8pdi> .

ddavydenko · 2018-12-07T18:52:18Z

@fenrus75, do you have any ideas on a timeline when you might be able to adopt your patch into master? we (I am working with vrakesh@) are looking at a timeline of when we can consume OpenBLAS with the fix applied to mitigate MXNet performance degradation for RNNs...

fenrus75 · 2018-12-07T19:43:38Z

I guess priority just got bumped to this weekend??

…

On Fri, Dec 7, 2018 at 10:52 Denis Davydenko ***@***.***> wrote: @fenrus75 <https://github.com/fenrus75>, do you have any ideas on a timeline when you might be able to adopt your patch into master? we (I am working with vrakesh@) are looking at a timeline of when we can consume OpenBLAS with the fix applied to mitigate MXNet performance degradation for RNNs... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1897 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPeFf7p2gHY2OyZEwwAXX-YXRZ4_BSTks5u2rjpgaJpZM4Y8pdi> .

ddavydenko · 2018-12-09T05:04:47Z

Not sure I understood your answer, @fenrus75. If you could share some timeline that would definitely help. Am just trying to set expectations within the team and ETA would help.

brada4 · 2018-12-09T06:15:07Z

@ddavydenko could uou please first make sure to help your colleague to isolate problem, i.e. if it is core oversubscription, sub-optimal work splitter, or really only the missing skylakex code.
This is volunteer project, there is no ETA

fenrus75 · 2018-12-12T17:35:33Z

#1914

@ddavydenko pull request made.

ddavydenko · 2018-12-12T18:02:38Z

awesome, @fenrus75 , appreciate the efforts!

vrakesh · 2018-12-13T18:15:21Z

@fenrus75 This is awesome, really appreciate it thank you. :)

lupesko · 2018-12-18T01:57:40Z

@fenrus75 @martin-frbg - when are you planning to cut a release with this fix?

fenrus75 · 2018-12-18T02:17:03Z

I don't have the privs to make a release; that's martin. I just do things to make x86 cpus go faster ;-)

(fwiw this weekend Martin helped me land some optimizations for AVX2 that will also help this case on pre-SKX systems)

lupesko · 2018-12-18T02:50:17Z

I just do things to make x86 cpus go faster ;-)

^^^ Sounds like a decent line of business to me 🥇

Will wait for @martin-frbg feedback on the question...

brada4 · 2018-12-18T06:07:29Z

@lupesko may I make a wild guess that it will be planned when this fix reaches (at least modern) low-end platforms suitable for MXNET?

martin-frbg · 2018-12-18T08:05:36Z

Next release is currently planned for Dec 31st (see https://github.com/xianyi/OpenBLAS/milestones)

lupesko · 2018-12-18T15:56:24Z

Thanks - I will work with the MXNet community to test and confirm we see enhanced performance once this is released.

brada4 · 2018-12-19T12:58:24Z

@lupesko are initially presented timings now to call community via release notes?

lupesko · 2018-12-19T17:12:59Z

@brada4 can you please clarify your question?

brada4 · 2018-12-19T20:02:41Z

ATLAS (?tatlas? ?satlas?)
Fullyconnected takes 0.823ms
OpenBLAS (?pth? ?omp? ?single?)
Fullyconnected takes 2.623ms

Just need quantize vs initial input that effort is worth the matches.

ddavydenko · 2019-01-22T20:31:38Z

Checking back in on this one. Any plans to include this in particular OpenBLAS release in near future?

martin-frbg · 2019-01-22T20:42:40Z

0.3.5 was released on Dec 31 as planned and includes it. Note however that the Julia folks seem to have found a bug in the SkylakeX DGEMM microkernel (#1955) that was added in 0.3.4.

martin-frbg · 2019-02-20T22:37:23Z

So did either of you give 0.3.5 a try yet, and if so, did you experience any problems ?

vrakesh · 2019-02-23T08:24:11Z

The issue has been resolved :) thanks a lot . OpenBLAS is now better than ATLAS, in the particular operator.

brada4 mentioned this issue Dec 1, 2018

Functions are not guarded against early threading #1886

Closed

vrakesh closed this as completed Feb 23, 2019

OpenBLAS vs ATLAS performance for MXNET #1897

OpenBLAS vs ATLAS performance for MXNET #1897

Comments

vrakesh commented Dec 1, 2018

martin-frbg commented Dec 1, 2018

brada4 commented Dec 1, 2018 • edited Loading

brada4 commented Dec 1, 2018 • edited Loading

vrakesh commented Dec 2, 2018 • edited Loading

brada4 commented Dec 2, 2018 • edited Loading

fenrus75 commented Dec 3, 2018 • edited Loading

vrakesh commented Dec 3, 2018

fenrus75 commented Dec 3, 2018 via email

vrakesh commented Dec 3, 2018 • edited Loading

martin-frbg commented Dec 3, 2018

vrakesh commented Dec 3, 2018 • edited Loading

brada4 commented Dec 3, 2018 • edited Loading

vrakesh commented Dec 4, 2018

brada4 commented Dec 4, 2018

vrakesh commented Dec 4, 2018 • edited Loading

fenrus75 commented Dec 4, 2018 via email

martin-frbg commented Dec 4, 2018

brada4 commented Dec 4, 2018 • edited Loading

vrakesh commented Dec 4, 2018 • edited Loading

fenrus75 commented Dec 5, 2018

brada4 commented Dec 5, 2018

fenrus75 commented Dec 5, 2018

martin-frbg commented Dec 5, 2018

fenrus75 commented Dec 5, 2018

vrakesh commented Dec 5, 2018 • edited Loading

fenrus75 commented Dec 5, 2018 via email

vrakesh commented Dec 6, 2018

fenrus75 commented Dec 6, 2018 via email

ddavydenko commented Dec 7, 2018

fenrus75 commented Dec 7, 2018 via email

ddavydenko commented Dec 9, 2018

brada4 commented Dec 9, 2018

fenrus75 commented Dec 12, 2018

ddavydenko commented Dec 12, 2018

vrakesh commented Dec 13, 2018

lupesko commented Dec 18, 2018

fenrus75 commented Dec 18, 2018

lupesko commented Dec 18, 2018

brada4 commented Dec 18, 2018

martin-frbg commented Dec 18, 2018

lupesko commented Dec 18, 2018

brada4 commented Dec 19, 2018

lupesko commented Dec 19, 2018

brada4 commented Dec 19, 2018 • edited Loading

ddavydenko commented Jan 22, 2019

martin-frbg commented Jan 22, 2019

martin-frbg commented Feb 20, 2019

vrakesh commented Feb 23, 2019

brada4 commented Dec 1, 2018 •

edited

Loading

brada4 commented Dec 1, 2018 •

edited

Loading

vrakesh commented Dec 2, 2018 •

edited

Loading

brada4 commented Dec 2, 2018 •

edited

Loading

fenrus75 commented Dec 3, 2018 •

edited

Loading

vrakesh commented Dec 3, 2018 •

edited

Loading

vrakesh commented Dec 3, 2018 •

edited

Loading

brada4 commented Dec 3, 2018 •

edited

Loading

vrakesh commented Dec 4, 2018 •

edited

Loading

brada4 commented Dec 4, 2018 •

edited

Loading

vrakesh commented Dec 4, 2018 •

edited

Loading

vrakesh commented Dec 5, 2018 •

edited

Loading

brada4 commented Dec 19, 2018 •

edited

Loading