Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS vs ATLAS performance for MXNET #1897

Closed
vrakesh opened this issue Dec 1, 2018 · 48 comments
Closed

OpenBLAS vs ATLAS performance for MXNET #1897

vrakesh opened this issue Dec 1, 2018 · 48 comments

Comments

@vrakesh
Copy link

vrakesh commented Dec 1, 2018

In most cases openBLAS has been performant than ATLAS, when used with deep learning operators of MXNET.

But recently I discovered that one of the mxnet operators, FullyConnected Operator, was performing worse on when using OpenBLAS vs ATLAS.

I used the mxnet profiler to identify the issue.

To reproduce

  1. We are using the following python script with a bash variable enabled
export MXNET_PROFILER_AUTOSTART=1
import mxnet as mx
from mxnet import profiler
profiler.set_config(profile_all=True, aggregate_stats=True, filename='profile_output.json')
data = mx.symbol.Variable('data')
net = mx.sym.FullyConnected(data=data, name='fc1', num_hidden=512)
print(net.list_arguments())
data = mx.nd.random.normal(-1, 1, (512,))
fc1_weight = mx.nd.random.normal(-1, 1, (512, 1))
fc1_bias = mx.nd.random.normal(-1, 1, (512,))
exe = net.bind(ctx=mx.cpu(), args={'data': data, 'fc1_weight': fc1_weight, 'fc1_bias': fc1_bias})
exe.forward()
exe.outputs[0].wait_to_read()
profiler.dump()
print(profiler.dumps())
  1. First compile mxnet with ATLAS
# Execute the following from mxnet root src folder
make -j $(nproc) USE_BLAS=atlas
cd python
python setup.py install
python <run-script-above>.py

This should spit out the performance of all operators of network with CBLAS
Fullyconnected takes 0.823ms

  1. Now compile with OPENBLAS
# Execute the following from mxnet root src folder
make -j $(nproc) USE_BLAS=openblas
cd python
python setup.py install
python <run-script-above>.py

This should spit out the performance of all operators of network with CBLAS
Fullyconnected takes 2.623ms

Openblas version takes nearly 3 times more to compute, this problem compounds in huge networks.

Why is this the case?

Underneath fully connected we call the CBLAS_GEMM function with the shapes above.

Looking for insight from the openblas community

@martin-frbg
Copy link
Collaborator

Which version of OpenBLAS is this, and on what kind of hardware ? (And if multithreading, can you tell how many threads are used in both cases ?)

@brada4
Copy link
Contributor

brada4 commented Dec 1, 2018

Could you set OPENBLAS_NUM_THREADS=1 and run openblas sample for another time?
The actual cblas calls are so short that linux perf does not catch them, but it catches pthread locks and thread server. Also pip install pulled binary numpy with openblas 0.3.0.dev included.

It could be slight misconfiguration that you use pthread version in openmp caller - problem yielding N^2 BLAS threads.

@brada4
Copy link
Contributor

brada4 commented Dec 1, 2018

@vrakesh could you try with develop version? pth locks in blas_thread_server were O(n^2) becoming painfully visible for lots of CPU cores.
OMP version (libopenblaso.so on Fedora etc) would use single-threaded version inside parallel sections, pthread version has no way to detect it and goes full speed.

perf record 0.3.3 pthreads on 1-core 2ht ATOM (none reaches perf timers with 1 thread openblas EDIT: 0.1% thread server with develop version and none of others caught)

1.30%  python  [.] blas_thread_server
0.01%  python  [.] pthread_mutex_unlock@plt
0.01%  python  [.] sgemm_beta_ATOM
0.01%  python  [.] sgemm_kernel_ATOM
0.01%  python  [.] sched_yield@plt
0.01%  python  [.] pthread_mutex_lock@plt

i.e 0.02% of productive computation wrapped with 1.33% (that potentially quadruples doubling number of CPUs) of thread orchestration.

Also something is weird with libmxnet.so linking to openblas that actually spurious liblapack.so import is added (it is a copy of same openblas in my case, yields "undefined brhaviour" when that is different from BLAS imported)

Followeed you to the letter except
python setup.py install --prefix=$HOME/.local/

inventory EDIT Leave those need to be guarded against early threading #1886

cblas_dsyrk
cblas_ssyrk
dpotri_
spotri_

@vrakesh
Copy link
Author

vrakesh commented Dec 2, 2018

Really appreciate the quick response guys,

Here is some more information, requested by @martin-frbg

Hardware Configuration

  1. Intel Xeon @3.0 Ghz, 4 cores and 8 threads.

OpenBLAS versions tried : 0.2.18 and 0.3.3, yields similar problems

@brada4 Thank you for you suggestions and analysis, I will try out you suggestions and see if yields a different result.

@brada4
Copy link
Contributor

brada4 commented Dec 2, 2018

We need the processor trade name, there are plenty of 3.0GHz Xeons, at least one in each generation. e.g. one marketing "model name" line or section from last core from cat /proc/cpuinfo

First test is to set OPENBLAS_NUM_THREADS=1 and see that times normalize to (hopefully) better than ATLAS.

  • If it improves matters you can make your current pthread build work well with this parameter
  • You can build OpenMP openblas which does same thread reduction automagically in parallel sections without any specific parameters, but stays parallel in normal code.
  • or better develop that will soon go as 0.3.4 where thread management overhead is eased

Note that present test sample with pthread build in OMP parallel section does milliseconds long ncpu^2 thread oversubscription, bigger sized samples will slow down because of that, the worse the more cores you have in your CPU.

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 3, 2018

also if you have a recent xeon that supports AVX512, you should see roughly a 2x speedup with 0.3.4
(but do compile openblas for SKYLAKEX to get that)

@vrakesh
Copy link
Author

vrakesh commented Dec 3, 2018

The trade name is Xeon(R) Platinum 8124M , I will try out the new version as well. Thanks again, will get back with results, based on suggestions

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 3, 2018 via email

@vrakesh
Copy link
Author

vrakesh commented Dec 3, 2018

Hi all,
Here are some updates

  1. The past builds were indeed compiled with SKYLAKEX, the makefile automatically has been detecting it

  2. I tried building 0.3.3 with OPENMP and also tried setting OPENBLAS_NUM_THREADS=1 when not using OPENMP, neither of them seem to resolve the performance difference.

  3. I tried building the development and 0.3.4 release, both of them fail with the following error.

Makefile.L3:532: recipe for target 'sgemm_kernel.o' failed
make[1]: *** [sgemm_kernel.o] Error 1
make[1]: *** Waiting for unfinished jobs....
../kernel/x86_64/sgemm_ncopy_4_skylakex.c: In function ‘sgemm_oncopy’:
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:36: warning: unused variable ‘ctemp16’ [-Wunused-variable]
   FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
                                    ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:27: warning: unused variable ‘ctemp15’ [-Wunused-variable]
   FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
                           ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:53:18: warning: unused variable ‘ctemp14’ [-Wunused-variable]
   FLOAT ctemp13, ctemp14, ctemp15, ctemp16;
                  ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:36: warning: unused variable ‘ctemp12’ [-Wunused-variable]
   FLOAT  ctemp9, ctemp10, ctemp11, ctemp12;
                                    ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:27: warning: unused variable ‘ctemp11’ [-Wunused-variable]
   FLOAT  ctemp9, ctemp10, ctemp11, ctemp12;
                           ^
../kernel/x86_64/sgemm_ncopy_4_skylakex.c:52:18: warning: unused variable ‘ctemp10’ [-Wunused-variable]
   FLOAT  ctemp9, ctemp10, ctemp11, ctemp12;
                  ^
../kernel/x86_64/sgemm_beta_skylakex.c: In function ‘sgemm_beta’:
../kernel/x86_64/sgemm_beta_skylakex.c:67:12: warning: AVX512F vector return without AVX512F enabled changes the ABI [-Wpsabi]
     z_zero = _mm512_setzero_ps();
            ^
../kernel/x86_64/sgemm_beta_skylakex.c:68:12: warning: AVX vector return without AVX enabled changes the ABI [-Wpsabi]
     y_zero = _mm256_setzero_ps();
            ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:41:0,
                 from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avxintrin.h:1202:1: error: inlining failed in call to always_inline ‘_mm256_setzero_ps’: target specific option mismatch
 _mm256_setzero_ps (void)
 ^
../kernel/x86_64/sgemm_beta_skylakex.c:68:12: error: called from here
     y_zero = _mm256_setzero_ps();
            ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
                 from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:237:1: error: inlining failed in call to always_inline ‘_mm512_setzero_ps’: target specific option mismatch
 _mm512_setzero_ps (void)
 ^
../kernel/x86_64/sgemm_beta_skylakex.c:67:12: error: called from here
     z_zero = _mm512_setzero_ps();
            ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
                 from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:5746:1: error: inlining failed in call to always_inline ‘_mm512_storeu_ps’: target specific option mismatch
 _mm512_storeu_ps (void *__P, __m512 __A)
 ^
../kernel/x86_64/sgemm_beta_skylakex.c:78:4: error: called from here
    _mm512_storeu_ps(c_offset1 + 16, z_zero);
    ^
In file included from /usr/lib/gcc/x86_64-linux-gnu/5/include/immintrin.h:45:0,
                 from ../kernel/x86_64/sgemm_beta_skylakex.c:41:
/usr/lib/gcc/x86_64-linux-gnu/5/include/avx512fintrin.h:5746:1: error: inlining failed in call to always_inline ‘_mm512_storeu_ps’: target specific option mismatch
 _mm512_storeu_ps (void *__P, __m512 __A)

Command that was run

make -j $(nproc) TARGET=SKYLAKEX

The full /cat/proc cpuinfo

processor	: 0
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 0
initial apicid	: 0
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 1
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 2
initial apicid	: 2
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 2
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 2
cpu cores	: 4
apicid		: 4
initial apicid	: 4
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 6
initial apicid	: 6
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 4
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 0
cpu cores	: 4
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 5
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 1
cpu cores	: 4
apicid		: 3
initial apicid	: 3
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 6
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 2
cpu cores	: 4
apicid		: 5
initial apicid	: 5
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

processor	: 7
vendor_id	: GenuineIntel
cpu family	: 6
model		: 85
model name	: Intel(R) Xeon(R) Platinum 8124M CPU @ 3.00GHz
stepping	: 3
microcode	: 0x1000141
cpu MHz		: 3000.000
cache size	: 25344 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 13
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single kaiser fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx avx512f rdseed adx smap clflushopt clwb avx512cd xsaveopt xsavec xgetbv1 ida arat pku
bugs		: cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf
bogomips	: 6000.00
clflush size	: 64
cache_alignment	: 64
address sizes	: 46 bits physical, 48 bits virtual
power management:

@martin-frbg
Copy link
Collaborator

The error looks as if your CFLAGS did not contain -march=skylake-avx512 to enable AVX512 support in your compiler (strange as this should be auto-added by 0.3.4 at least), or your assembler may be too old.

@vrakesh
Copy link
Author

vrakesh commented Dec 3, 2018

My assembler is

GNU assembler (GNU Binutils for Ubuntu) 2.26.1
Copyright (C) 2015 Free Software Foundation, Inc.
This program is free software; you may redistribute it under the terms of
the GNU General Public License version 3 or later.
This program has absolutely no warranty.
This assembler was configured for a target of `x86_64-linux-gnu'.

What is the minimum required version?

I will try getting the latest GCC toolset, and see if it resolves the build issue

@brada4
Copy link
Contributor

brada4 commented Dec 3, 2018

#1797
DYNAMIC_ARCH would detect & disable absent AVX-512 support in compiler. gcc-6 and newer from ubuntu-made ppa (xenial / 16.04 from binutils version tag) will do just fine (also add ppa fortran, mixing version will yield problems few minutes later)
Centos7 binutils 2.25.1 suffice for AVX-512 assembly.

@vrakesh
Copy link
Author

vrakesh commented Dec 4, 2018

Thanks to all your inputs was able to compile dev version with AVX-512.

Good new is that with AVX-512 and thread settings, the performance improves 1.3ms , but still slightly behind 0.8ms of ATLAS. Will be glad to incorporate any other suggestions.

@brada4
Copy link
Contributor

brada4 commented Dec 4, 2018

Which suggestion?
Setting OPENBLAS_NUM_THREADS to 1
Using openmp build of openblas

@vrakesh
Copy link
Author

vrakesh commented Dec 4, 2018

I have already incorporated those changes as part of the changes , was curious to know if there was something else we could consider. Either way the performance has improved, will try this as part of a bigger network, and post an update here in the issue.

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 4, 2018 via email

@martin-frbg
Copy link
Collaborator

You could try building with USE_TLS=1 to get the experimental thread-local storage allocator that should reduce thread creation overhead. Not sure if that is ready for production use however.

@brada4
Copy link
Contributor

brada4 commented Dec 4, 2018

I do not understand changes made. Setting OPENBLAS_NUM_THREADS=1 should have immediate easily comparable effect on the sample being run.
If you run ldd against binary produced - it may as well link to old ubuntu package of OpenBLAS, which should be quite bad for small GEMM
Very hackish fix would be to LD_PRELOAD freshly built libopenblas*.so so that it certainly overrides imported library and eventually also liblapack.so

@vrakesh
Copy link
Author

vrakesh commented Dec 4, 2018

I actually sym linked the new openblas, with usr/lib/libopenblas, removing the older link by default, I also did a ldd check once the libmxnet was built, it it using the newer blas I built.

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 5, 2018

(poking at your test case)

the matrix multiply that is done has parameters
M N K : 512 512 1

I'll add these dimensions to my testbed

@brada4
Copy link
Contributor

brada4 commented Dec 5, 2018

My only point is that reducing threads to 1 would remove burden of (unnecessary in mxnet OpenMP context) thread management.
Actually the repeater sample is quite good illustration of the problem.

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 5, 2018

(512 512 with a K of 1 is a bit of an odd matrix...)

I wonder if the code you have does not actually use sgemm but a vector dot product instead for the main of the code.

@martin-frbg
Copy link
Collaborator

Could be that ATLAS forwards the sgemm call to sgemv. This has been noted as a potential optimization before, but not yet been implemented in OpenBLAS.

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 5, 2018

+vrakesh if you're adventurous, could you try the patch at http://git.fenrus.org/tmp/SMALL2.patch ?

(Martin: that is not yet for merge, but it's a component of a next optimization I'm working on for small (< 192x192x192) matrix sgemm)

@vrakesh
Copy link
Author

vrakesh commented Dec 5, 2018

@fenrus75 I can try that out and see. 512 x 512 x1 is the last fully connected layer of a bigger network we use, it was slowing down when we ran it on openblas.

@brada4 Ah yes I agree, but, it seems like it does not help much based on my test runs, I actually tried the tests in a couple of different hardware setups as well. Will see if fenurs75's patch helps.

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 5, 2018 via email

@vrakesh
Copy link
Author

vrakesh commented Dec 6, 2018

Tried out fenrus75's patch, I can get upto 45% better performance than CBLAS with this change, for this particular use case. A step in the right direction I guess

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 6, 2018 via email

@ddavydenko
Copy link

@fenrus75, do you have any ideas on a timeline when you might be able to adopt your patch into master? we (I am working with vrakesh@) are looking at a timeline of when we can consume OpenBLAS with the fix applied to mitigate MXNet performance degradation for RNNs...

@fenrus75
Copy link
Contributor

fenrus75 commented Dec 7, 2018 via email

@ddavydenko
Copy link

Not sure I understood your answer, @fenrus75. If you could share some timeline that would definitely help. Am just trying to set expectations within the team and ETA would help.

@brada4
Copy link
Contributor

brada4 commented Dec 9, 2018

@ddavydenko could uou please first make sure to help your colleague to isolate problem, i.e. if it is core oversubscription, sub-optimal work splitter, or really only the missing skylakex code.
This is volunteer project, there is no ETA

@fenrus75
Copy link
Contributor

#1914

@ddavydenko pull request made.

@ddavydenko
Copy link

awesome, @fenrus75 , appreciate the efforts!

@vrakesh
Copy link
Author

vrakesh commented Dec 13, 2018

@fenrus75 This is awesome, really appreciate it thank you. :)

@lupesko
Copy link

lupesko commented Dec 18, 2018

@fenrus75 @martin-frbg - when are you planning to cut a release with this fix?

@fenrus75
Copy link
Contributor

I don't have the privs to make a release; that's martin. I just do things to make x86 cpus go faster ;-)

(fwiw this weekend Martin helped me land some optimizations for AVX2 that will also help this case on pre-SKX systems)

@lupesko
Copy link

lupesko commented Dec 18, 2018

I just do things to make x86 cpus go faster ;-)

^^^ Sounds like a decent line of business to me 🥇

Will wait for @martin-frbg feedback on the question...

@brada4
Copy link
Contributor

brada4 commented Dec 18, 2018

@lupesko may I make a wild guess that it will be planned when this fix reaches (at least modern) low-end platforms suitable for MXNET?

@martin-frbg
Copy link
Collaborator

Next release is currently planned for Dec 31st (see https://github.com/xianyi/OpenBLAS/milestones)

@lupesko
Copy link

lupesko commented Dec 18, 2018

Thanks - I will work with the MXNet community to test and confirm we see enhanced performance once this is released.

@brada4
Copy link
Contributor

brada4 commented Dec 19, 2018

@lupesko are initially presented timings now to call community via release notes?

@lupesko
Copy link

lupesko commented Dec 19, 2018

@brada4 can you please clarify your question?

@brada4
Copy link
Contributor

brada4 commented Dec 19, 2018

ATLAS (?tatlas? ?satlas?)
Fullyconnected takes 0.823ms
OpenBLAS (?pth? ?omp? ?single?)
Fullyconnected takes 2.623ms

Just need quantize vs initial input that effort is worth the matches.

@ddavydenko
Copy link

Checking back in on this one. Any plans to include this in particular OpenBLAS release in near future?

@martin-frbg
Copy link
Collaborator

0.3.5 was released on Dec 31 as planned and includes it. Note however that the Julia folks seem to have found a bug in the SkylakeX DGEMM microkernel (#1955) that was added in 0.3.4.

@martin-frbg
Copy link
Collaborator

So did either of you give 0.3.5 a try yet, and if so, did you experience any problems ?

@vrakesh
Copy link
Author

vrakesh commented Feb 23, 2019

The issue has been resolved :) thanks a lot . OpenBLAS is now better than ATLAS, in the particular operator.

@vrakesh vrakesh closed this as completed Feb 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants