Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BLAS.set_num_threads(1) drastically improves dynamics! performance for Atlas benchmark #500

Open
tkoolen opened this issue Oct 13, 2018 · 12 comments

Comments

@tkoolen
Copy link
Collaborator

tkoolen commented Oct 13, 2018

Found this out this week. Baseline on current master:

BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     32.997 μs (0.00% GC)
  median time:      34.839 μs (0.00% GC)
  mean time:        36.525 μs (0.00% GC)
  maximum time:     95.380 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

(a bit slower than http://www.juliarobotics.org/RigidBodyDynamics.jl/latest/benchmarks.html for some reason, by the way)

After using LinearAlgebra; BLAS.set_num_threads(1):

BenchmarkTools.Trial:
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     21.929 μs (0.00% GC)
  median time:      22.224 μs (0.00% GC)
  mean time:        23.062 μs (0.00% GC)
  maximum time:     40.309 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

So multiple BLAS threads aren't doing us any favor (with the default OpenBLAS anyway), at least for fewer-than-Atlas degrees of freedom.

Not sure how to act on this since this is a global setting. At least BLAS.set_num_threads is fast now (about 4 ns, no allocations), so it could be called before and after dynamics_solve!. But is that thread safe? And how do we get the number of BLAS threads before calling BLAS.set_num_threads(1) so that it can be reset to the original value? Maybe it should just be a performance tip?

@tkoolen
Copy link
Collaborator Author

tkoolen commented Oct 22, 2018

With MKL instead of OpenBLAS this is not an issue. I suspect OpenMathLib/OpenBLAS#1622.

@brada4
Copy link

brada4 commented Dec 26, 2018

Can you get 'perf record "command" ; perf report" from both cases?
...zoom to see which functions in libopenblas.so.0 are being touched (probably add iterations so that more is seen)

Just in case it is something else than _GEMM, ones with less dimensions can be calibrated faster.
There are some improvements for small GEMM lately for recent Intel series (if that applies)

Julia's versioninfo() should tell enough about BLAS & CPU core in use.

@tkoolen
Copy link
Collaborator Author

tkoolen commented Dec 29, 2018

@brada4, thank you for looking at this.

Here's what I did (verbose, mostly just to make this reproducible for myself). I created a test.jl file containing

using RigidBodyDynamics
using AtlasRobot
using Random

function run_benchmark(result, state, n)
    for i = 1 : n
        rand!(state)
        dynamics!(result, state)
    end
    return nothing
end

function main()
    mechanism = AtlasRobot.mechanism()
    state = MechanismState(mechanism)
    result = DynamicsResult(mechanism)
    @time run_benchmark(result, state, 1_000_000)
end

main()

Running this benchmark with julia -O3 --check-bounds=no takes about 35 seconds; compilation time should only be a small fraction of this time.

I then ran sudo cpupower frequency-set --governor performance, followed by

sudo perf record julia --project -O3 --check-bounds=no test.jl # sudo to avoid 'Couldn't record kernel reference relocation symbol' warning

(after running sudo sh -c 'echo -1 >/proc/sys/kernel/perf_event_paranoid').

Output:

 36.554262 seconds (12.29 M allocations: 557.993 MiB, 0.81% gc time)
[ perf record: Woken up 160 times to write data ]
[ perf record: Captured and wrote 43.549 MB perf.data (1141059 samples) ]

Top entries from perf report:

  12.07%  julia    libopenblas64_.so    [.] dtrsm_kernel_RN_HASWELL
  10.85%  julia    [kernel.kallsyms]    [k] syscall_return_via_sysret
   4.57%  julia    libpthread-2.27.so   [.] __pthread_mutex_lock
   4.32%  julia    [kernel.kallsyms]    [k] __schedule
   3.81%  julia    libopenblas64_.so    [.] dgemm_kernel_HASWELL
   3.41%  julia    libopenblas64_.so    [.] dtrsm_oltncopy_HASWELL
   3.21%  julia    [kernel.kallsyms]    [k] update_curr
   2.68%  julia    [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe
   2.44%  julia    [kernel.kallsyms]    [k] do_syscall_64
   2.30%  julia    [kernel.kallsyms]    [k] pick_next_task_fair
   1.83%  julia    libc-2.27.so         [.] __sched_yield
   1.77%  julia    [kernel.kallsyms]    [k] __calc_delta
   1.41%  julia    libopenblas64_.so    [.] blas_thread_server
   1.40%  julia    [kernel.kallsyms]    [k] yield_task_fair
   1.35%  julia    [unknown]            [k] 0xfffffe00000e201b
   1.34%  julia    [unknown]            [k] 0xfffffe000016601b
   1.33%  julia    [unknown]            [k] 0xfffffe000019201b
   1.33%  julia    [unknown]            [k] 0xfffffe000013a01b
   1.25%  julia    [kernel.kallsyms]    [k] sys_sched_yield
   1.21%  julia    [kernel.kallsyms]    [k] pick_next_entity
   1.17%  julia    [unknown]            [k] 0xfffffe000005e01b
   1.15%  julia    [kernel.kallsyms]    [k] rcu_note_context_switch
   1.08%  julia    libpthread-2.27.so   [.] __pthread_mutex_unlock
   1.06%  julia    [unknown]            [k] 0xfffffe000008a01b
   1.04%  julia    [unknown]            [k] 0xfffffe00000b601b
   1.04%  julia    libopenblas64_.so    [.] dgemm_itcopy_HASWELL
   1.04%  julia    [kernel.kallsyms]    [k] _raw_spin_lock
   1.01%  julia    [kernel.kallsyms]    [k] __indirect_thunk_start
   0.93%  julia    [kernel.kallsyms]    [k] native_sched_clock
   0.89%  julia    [kernel.kallsyms]    [k] cpuacct_charge
   0.87%  julia    [kernel.kallsyms]    [k] update_min_vruntime
   0.84%  julia    [kernel.kallsyms]    [k] entry_SYSCALL_64_stage2
   0.76%  julia    [kernel.kallsyms]    [k] schedule
   0.75%  julia    [kernel.kallsyms]    [k] update_rq_clock
   0.75%  julia    [kernel.kallsyms]    [k] cgroup_cpu_stat_updated
   0.66%  julia    [kernel.kallsyms]    [k] __cgroup_account_cputime
   0.51%  julia    [kernel.kallsyms]    [k] sched_clock_cpu

The results above are all without calling BLAS.set_num_threads(1). For comparison, here are the results, with BLAS.set_num_threads(1).

   4.92%  julia    libopenblas64_.so    [.] dgemm_kernel_HASWELL
   2.08%  julia    libjulia.so.1.0      [.] jl_gc_pool_alloc
   1.84%  julia    libjulia.so.1.0      [.] forall_exists_subtype
   1.83%  julia    libjulia.so.1.0      [.] obviously_disjoint.part.14
   1.79%  julia    libjulia.so.1.0      [.] subtype
   1.71%  julia    libjulia.so.1.0      [.] gc_sweep_pool
   1.63%  julia    libjulia.so.1.0      [.] gc_mark_loop
   1.52%  julia    libjulia.so.1.0      [.] jl_unwrap_unionall
   1.40%  julia    libopenblas64_.so    [.] dgemv_n_HASWELL
   1.34%  julia    libopenblas64_.so    [.] dtrsm_kernel_RN_HASWELL
   1.18%  julia    libjulia.so.1.0      [.] jl_apply_generic
   1.06%  julia    libjulia.so.1.0      [.] jl_f_getfield
   1.01%  julia    libopenblas64_.so    [.] dsyrk_kernel_L
   0.96%  julia    libopenblas64_.so    [.] dgemm_beta_HASWELL
   0.92%  julia    libjulia.so.1.0      [.] jl_field_index
   0.89%  julia    libopenblas64_.so    [.] dgemm_otcopy_HASWELL
   0.85%  julia    libjulia.so.1.0      [.] jl_alloc_array_1d
   0.78%  julia    libopenblas64_.so    [.] dtrsm_iltncopy_HASWELL
   0.71%  julia    libc-2.27.so         [.] cfree@GLIBC_2.2.5
   0.69%  julia    libopenblas64_.so    [.] dtrsm_ilnncopy_HASWELL
   0.67%  julia    libopenblas64_.so    [.] dpotf2_L
   0.65%  julia    libc-2.27.so         [.] __memmove_avx_unaligned_erms
   0.65%  julia    sys.so               [.] japi1_cache_lookup_658.clone_1.clone_2
   0.59%  julia    libjulia.so.1.0      [.] has_free_typevars
   0.58%  julia    libjulia.so.1.0      [.] jl_typemap_entry_assoc_exact
   0.56%  julia    libjulia.so.1.0      [.] lookup_type_idx.isra.12
   0.56%  julia    libopenblas64_.so    [.] dtrsm_RN_solve_opt
   0.55%  julia    libopenblas64_.so    [.] dgemm_itcopy_HASWELL
   0.55%  julia    libjulia.so.1.0      [.] forall_exists_equal.part.23
   0.54%  julia    libjulia.so.1.0      [.] typekey_compare.part.11
   0.53%  julia    libopenblas64_.so    [.] dtrsm_kernel_LN_HASWELL
   0.50%  julia    libopenblas64_.so    [.] dot_compute

Julia version info:

julia> using InteractiveUtils; versioninfo()
Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)

julia> using LinearAlgebra; BLAS.openblas_get_config()
"USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16"

This is on an i7-6950X CPU @ 3.00GHz (10 physical cores, 20 hyperthreaded).

Let me know if you need the perf.data file.

@brada4
Copy link

brada4 commented Dec 29, 2018

There is no need for perf.data, it is nearly impossible to decode on even slightly different system, does not look very specific to your setup, just that 10/20 cores dramaticize excess parallelism effects better than more common CPUs.
Looks like trsm threads start to early
OpenMathLib/OpenBLAS#1886
Best you can do before after new year - add #undef SMP early in interface/trsm.c and rebuild OpenBLAS , i.e disable trsm threading but leave GEMM threading intact.

@tkoolen
Copy link
Collaborator Author

tkoolen commented Dec 29, 2018

OK, thanks, I'll track that issue.

@ferrolho
Copy link
Contributor

ferrolho commented May 6, 2023

Hello! I am wondering if this is still an issue? Here are my results on an Apple MacBook Air (M1, 2020):

--- results-1-thread.txt	2023-05-06 12:49:34
+++ results-4-threads.txt	2023-05-06 12:49:22
@@ -1,105 +1,105 @@
 
 dynamics_bias!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  3.054 μs …   8.408 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     3.129 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   3.145 μs ± 107.831 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  3.054 μs …   8.871 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     3.112 μs               ┊ GC (median):    0.00%
+ Time  (mean ± σ):   3.124 μs ± 120.414 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 dynamics!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  10.492 μs … 262.429 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     10.608 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   10.710 μs ±   2.535 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  10.508 μs …  26.962 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     10.600 μs               ┊ GC (median):    0.00%
+ Time  (mean ± σ):   10.612 μs ± 181.451 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 momentum:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  1.771 μs …  4.638 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     1.821 μs              ┊ GC (median):    0.00%
- Time  (mean ± σ):   1.823 μs ± 57.144 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  1.767 μs …  4.458 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     1.850 μs              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   1.850 μs ± 60.572 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 mass_matrix!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  3.800 μs …  12.929 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     3.946 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   3.971 μs ± 210.354 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  3.808 μs …   6.621 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     3.900 μs               ┊ GC (median):    0.00%
+ Time  (mean ± σ):   3.933 μs ± 132.276 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 momentum_matrix!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  2.112 μs …  11.562 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     2.188 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   2.190 μs ± 114.719 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  2.104 μs …  4.821 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     2.167 μs              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   2.170 μs ± 59.499 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 center_of_mass:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  1.171 μs …  4.283 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     1.221 μs              ┊ GC (median):    0.00%
- Time  (mean ± σ):   1.223 μs ± 65.128 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  1.175 μs …  4.196 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     1.225 μs              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   1.228 μs ± 57.789 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 geometric_jacobian!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  1.600 μs …  4.229 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     1.650 μs              ┊ GC (median):    0.00%
- Time  (mean ± σ):   1.656 μs ± 99.179 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  1.583 μs …  3.050 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     1.637 μs              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   1.641 μs ± 37.095 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 constraint_bias!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  5.871 μs …  14.075 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     5.967 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   6.060 μs ± 253.963 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  5.921 μs … 762.712 μs  ┊ GC (min … max): 0.00% … 98.98%
+ Time  (median):     6.017 μs               ┊ GC (median):    0.00%
+ Time  (mean ± σ):   6.172 μs ±   7.569 μs  ┊ GC (mean ± σ):  1.22% ±  0.99%
  Memory estimate: 496 bytes, allocs estimate: 1.
 
 gravitational_potential_energy:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  1.183 μs …  2.342 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     1.242 μs              ┊ GC (median):    0.00%
- Time  (mean ± σ):   1.241 μs ± 22.244 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  1.175 μs …  3.083 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     1.238 μs              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   1.240 μs ± 56.878 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 inverse_dynamics!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  2.683 μs …  13.863 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     2.771 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   2.788 μs ± 162.449 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  2.683 μs …  5.279 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     2.746 μs              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   2.754 μs ± 96.378 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 simulate tree:
 BenchmarkTools.Trial: 9 samples with 10 evaluations.
- Range (min … max):  49.602 ms …  50.468 ms  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     49.889 ms               ┊ GC (median):    0.00%
- Time  (mean ± σ):   49.976 ms ± 284.056 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  49.606 ms …  52.434 ms  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     50.114 ms               ┊ GC (median):    0.00%
+ Time  (mean ± σ):   50.338 ms ± 820.355 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 constraint_jacobian!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  14.696 μs … 832.163 μs  ┊ GC (min … max): 0.00% … 98.02%
- Time  (median):     15.104 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   15.251 μs ±   8.177 μs  ┊ GC (mean ± σ):  0.53% ±  0.98%
+ Range (min … max):  14.762 μs …  8.446 ms  ┊ GC (min … max): 0.00% … 99.78%
+ Time  (median):     15.196 μs              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   17.100 μs ± 88.417 μs  ┊ GC (mean ± σ):  4.93% ±  1.00%
  Memory estimate: 496 bytes, allocs estimate: 1.
 
 mass_matrix! and geometric_jacobian!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  4.071 μs …  7.154 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     4.188 μs              ┊ GC (median):    0.00%
- Time  (mean ± σ):   4.207 μs ± 96.458 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  4.050 μs …  17.783 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     4.225 μs               ┊ GC (median):    0.00%
+ Time  (mean ± σ):   4.273 μs ± 343.992 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 momentum_rate_bias:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  2.554 μs …   5.183 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     2.617 μs               ┊ GC (median):    0.00%
- Time  (mean ± σ):   2.624 μs ± 105.863 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  2.575 μs …   5.446 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     2.654 μs               ┊ GC (median):    0.00%
+ Time  (mean ± σ):   2.670 μs ± 101.424 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.
 
 kinetic_energy:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max):  1.758 μs …  3.658 μs  ┊ GC (min … max): 0.00% … 0.00%
- Time  (median):     1.817 μs              ┊ GC (median):    0.00%
- Time  (mean ± σ):   1.818 μs ± 44.967 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Range (min … max):  1.771 μs …  4.075 μs  ┊ GC (min … max): 0.00% … 0.00%
+ Time  (median):     1.854 μs              ┊ GC (median):    0.00%
+ Time  (mean ± σ):   1.855 μs ± 64.257 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
  Memory estimate: 0 bytes, allocs estimate: 0.

Here's the output of versioninfo():

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin22.1.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
  Threads: 1 on 4 virtual cores

And here is how I ran the benchmarks:

cd ~/git/RigidBodyDynamics.jl/perf
julia --project -O3 --check-bounds=no
include("runbenchmarks.jl")

I ran this once with the contents of perf/runbenchmarks.jl as is, and then a second time after commenting line 13 of perf/runbenchmarks.jl (the one that sets BLAS.set_num_threads(1)).

@ferrolho
Copy link
Contributor

ferrolho commented May 6, 2023

On my old (but trustworthy) Dell XPS 13 9343 (2015) using Julia 1.8.5, I also cannot observe any noticeable differences in performance:

--- results-single-thread.txt	2023-05-06 21:07:21.302449867 +0100
+++ results-multi-thread.txt	2023-05-06 21:07:26.526504507 +0100
@@ -1,59 +1,59 @@
 dynamics_bias!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   7.248 μs ± 454.342 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   7.252 μs ± 295.422 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 dynamics!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   24.448 μs ±  1.159 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   24.113 μs ±  1.609 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 momentum:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   4.582 μs ± 435.294 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   4.694 μs ± 276.963 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 mass_matrix!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   12.483 μs ± 553.176 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   11.935 μs ± 523.177 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 momentum_matrix!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   5.256 μs ± 250.141 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   5.262 μs ± 517.004 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 center_of_mass:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   3.007 μs ± 280.680 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   3.013 μs ± 222.686 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 geometric_jacobian!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   3.950 μs ± 198.062 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   4.064 μs ± 493.502 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 constraint_bias!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   14.426 μs ± 17.260 μs  ┊ GC (mean ± σ):  1.19% ±  0.99%
+ Time  (mean ± σ):   14.329 μs ± 16.452 μs  ┊ GC (mean ± σ):  1.14% ±  0.99%
 
 gravitational_potential_energy:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   3.001 μs ± 273.534 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   3.047 μs ± 243.480 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 inverse_dynamics!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   6.627 μs ± 410.959 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   6.737 μs ± 501.527 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 simulate tree:
 BenchmarkTools.Trial: 4 samples with 10 evaluations.
- Time  (mean ± σ):   116.726 ms ± 766.823 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   114.619 ms ±   1.122 ms  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 constraint_jacobian!:
-BenchmarkTools.Trial: 9410 samples with 10 evaluations.
- Time  (mean ± σ):   50.597 μs ± 19.226 μs  ┊ GC (mean ± σ):  0.39% ±  1.00%
+BenchmarkTools.Trial: 9236 samples with 10 evaluations.
+ Time  (mean ± σ):   51.511 μs ± 25.033 μs  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 mass_matrix! and geometric_jacobian!:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   12.391 μs ± 452.979 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   12.118 μs ± 715.884 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 momentum_rate_bias:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   6.308 μs ± 521.010 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   6.290 μs ± 357.568 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
 
 kinetic_energy:
 BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time  (mean ± σ):   4.585 μs ± 357.360 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%
+ Time  (mean ± σ):   4.571 μs ± 251.657 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

Here's the output of versioninfo() on this laptop:

julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
  Threads: 1 on 4 virtual cores

@ferrolho
Copy link
Contributor

ferrolho commented May 7, 2023

I did some further investigation and I've found that there was a massive leap in performance introduced by Julia version 1.8.

The plots below should demonstrate my findings a bit better.

Benchmark of RBD jl across Julia versions

The benchmark data and the script to generate the plot above can be found in this GitHub Gist.

@rdeits
Copy link
Collaborator

rdeits commented May 8, 2023

Interesting. I'd be excited to see things get that much faster from a newer Julia version, but I wonder if something else is going on here. As an example, the pre-1.8 results for dynamics_bias! look really slow. I would expect that call to take much less than 1ms on any Julia version. For example, here's what I get on my ~5 year old laptop running Julia 1.7.1:

dynamics_bias!:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.957 μs (0.00% GC)
  median time:      6.221 μs (0.00% GC)
  mean time:        6.223 μs (0.00% GC)
  maximum time:     10.473 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     10

That's much more in line with what you saw in 1.8 and later, and it's more like what I would have expected to begin with.

@ferrolho
Copy link
Contributor

ferrolho commented May 8, 2023

You're right! There's something else going on... I guess I got a bit too carried away with the idea of having a big improvement from a major Julia version upgrade. Let's focus on just using Julia 1.7.3, otherwise there's too many changing environment factors.

I've been trying to identify where the performance is taking a hit, and I've found that, using Julia v1.7.3, the degradation takes place somewhere between RBD.jl versions 2.3.0 and 2.3.1 ─ but it may not necessarily be a problem with RBD.jl (it may be one of its dependencies).

Note Because the runbenchmarks.jl script lives in a folder with its own Project.toml, we need to take special care changing versions in that environment before getting the results.

So, here is what I did step-by-step on my Dell XPS 13:

~/git/RigidBodyDynamics.jl/perf                    # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no        # Start a Julia 1.7.3 REPL
] add ..#b3a632064f1e2b731c5bb532201dcdb7a3567ea5  # Explicitly add RigidBodyDynamics v2.3.0
] up  # Update all the dependencies

CTRL + D  # Stop the Julia REPL session for the changes to deps versions to take effect

julia-1.7.3 --project -O3 --check-bounds=no        # Start a Julia 1.7.3 REPL
include("runbenchmarks.jl")                        # Run the benchmarks

Then I did the same for RBD.jl 2.3.1:

~/git/RigidBodyDynamics.jl/perf                    # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no        # Start a Julia 1.7.3 REPL
] add ..#b7388e086d6cd1b8d38002d692de6d7acd0570b7  # Explicitly add RigidBodyDynamics v2.3.1
] up  # Update all the dependencies

CTRL + D  # Stop the Julia REPL session for the changes to deps versions to take effect

julia-1.7.3 --project -O3 --check-bounds=no        # Start a Julia 1.7.3 REPL
include("runbenchmarks.jl")                        # Run the benchmarks

The dependencies' version changes look like this:
(Posting as a picture because the colourful output is helpful.)

Screenshot from 2023-05-08 22-53-13

Here are the results for those two different versions:

Table I. Time (mean) per method, for RBD.jl 2.3.0 vs 2.3.1.

Method RBD.jl 2.3.0 RBD.jl 2.3.1
dynamics_bias! 7.586 μs 821.496 μs
dynamics! 19.928 μs 664.098 μs
momentum 4.854 μs 205.602 μs
mass_matrix! 6.972 μs 258.167 μs
momentum_matrix! 5.528 μs 249.821 μs
center_of_mass 3.049 μs 101.816 μs
geometric_jacobian! 4.173 μs 142.554 μs
constraint_bias! 15.036 μs 1.368 ms
gravitational_potential_energy 3.034 μs 102.590 μs
inverse_dynamics! 6.885 μs 443.869 μs
constraint_jacobian! 33.388 μs 762.573 μs
mass_matrix! and geometric_jacobian! 7.480 μs 261.588 μs
momentum_rate_bias 6.851 μs 495.052 μs
kinetic_energy 4.850 μs 206.755 μs

Note simulate tree is excluded from the table above because on RBD.jl 2.3.1 it took too long to evaluate and provided only one sample for the benchmark.

Julia 1.7.3 with RBD.jl 2.3.0 produced the values which I think make the most sense (given Twan's previous benchmarks and Robin's comment above) ─ and so I am going to take those as my "ground truth" from now on.

Next, I focused on investigating changes to dependencies versions, while keeping Julia at v1.7.3 and RBD.jl at 2.3.1. The table below shows the results of those benchmarks. RBD.jl 2.3.1 with its default dependencies (1st results column) is the version where the performance has taken a hit, so what I tried was to downgrade some of its dependencies based on the image I shared above. I tried downgrading Rotations.jl (2nd results column), and then StaticArrays.jl as well (3rd results column).

Table II. Time (mean) per method, for RBD.jl 2.3.1 with different dependencies versions.

Julia 1.7.3 1.7.3 1.7.3 1.7.3 1.7.3
RBD.jl 2.3.1 2.3.1 2.3.1 2.3.1 2.3.1
Rotations.jl 1.4.0 1.3.4 1.3.0 1.2.0 1.2.0
StaticArrays.jl 1.5.24 1.5.24 1.5.24 1.5.24 0.12.6
dynamics_bias! 821.496 μs 845.163 μs 892.266 μs 7.421 μs 7.690 μs
dynamics! 664.098 μs 668.992 μs 732.807 μs 25.610 μs 19.421 μs
momentum 205.602 μs 207.841 μs 219.271 μs 4.732 μs 5.990 μs
mass_matrix! 258.167 μs 258.367 μs 288.175 μs 15.182 μs 7.858 μs
momentum_matrix! 249.821 μs 251.832 μs 265.107 μs 5.466 μs 6.053 μs
center_of_mass 101.816 μs 100.351 μs 105.490 μs 3.138 μs 3.085 μs
geometric_jacobian! 142.554 μs 144.602 μs 149.416 μs 4.184 μs 4.185 μs
constraint_bias! 1.368 ms 1.345 ms 1.299 ms 14.565 μs 15.719 μs
gravitational_potential_energy 102.590 μs 103.186 μs 105.668 μs 3.127 μs 3.063 μs
inverse_dynamics! 443.869 μs 447.642 μs 472.323 μs 6.909 μs 6.755 μs
constraint_jacobian! 762.573 μs 680.845 μs 691.560 μs 54.571 μs 34.138 μs
mass_matrix! and geometric_jacobian! 261.588 μs 261.593 μs 274.236 μs 12.337 μs 7.484 μs
momentum_rate_bias 495.052 μs 512.251 μs 522.313 μs 6.510 μs 6.504 μs
kinetic_energy 206.755 μs 208.422 μs 220.578 μs 4.957 μs 4.692 μs

Looking at the results in Table II, it seems the main performance hit is from Rotations.jl 1.21.4, but there is something happening also between StaticArrays.jl 0.12.61.5.24.

And that's it for now... 🙁 This is where I am at with my current investigation. I will try to dig a bit further tomorrow.

@ferrolho
Copy link
Contributor

ferrolho commented May 9, 2023

It is now clear to me that the performance hit was introduced with Rotations.jl 1.3.0, as the immediate previous version to that (1.2.0) produces results close to our "ground truth". (Let's set aside the smaller performance hit taken somewhere between StaticArrays.jl 0.12.61.5.24.)

Another dependency that changes when upgrading from Rotations.jl 1.2.0 to 1.3.0 is Quaternions.jl, from 0.4.9 to 0.5.7. However, I manually tested the combination of Rotations.jl 1.3.0 with Quaternions.jl 0.4.9, and the benchmark results were the same as the ones from using Rotations.jl 1.3.0 with Quaternions.jl 0.5.7 ─ meaning that the changes between those two versions of Quaternions.jl did not affect RBD.jl.

At this point, I went through the commits that make up Rotations.jl release 1.3.0. I found that the commit that introduced the "issue" was commit 345d0d9 ─ which is the merging of PR JuliaGeometry/Rotations.jl#219.

Using some intuition I tried to revert the changes to the RotMatrix constructor, and using that as a patch on top of version 1.3.0 ─ see ferrolho/Rotations.jl@9bea25b ─ allowed me to recover the good benchmark results from previous versions.

Table III. Time (mean) per method, for RBD.jl 2.3.1 and StaticArrays.jl 1.5.24, without and with a patched Rotations.jl 1.3.0.

Julia 1.7.3 1.7.3 1.8.5 1.7.3 1.8.5
RBD.jl 2.3.1 2.3.1 2.3.1 2.3.1 2.3.1
Rotations.jl 1.3.0 1.3.0 (patch) 1.3.0 (patch) 1.3.0 (patch) 1.3.0 (patch)
StaticArrays.jl 1.5.24 1.5.24 1.5.24 1.5.24 1.5.24
Number of Threads Single Single Single Multi Multi
dynamics_bias! 892.266 μs 7.860 μs 7.318 μs 7.679 μs 7.651 μs
dynamics! 732.807 μs 25.106 μs 24.347 μs 66.197 μs 24.443 μs
momentum 219.271 μs 4.694 μs 4.648 μs 4.817 μs 4.730 μs
mass_matrix! 288.175 μs 11.654 μs 12.146 μs 12.618 μs 12.280 μs
momentum_matrix! 265.107 μs 5.594 μs 5.356 μs 5.429 μs 5.731 μs
center_of_mass 105.490 μs 3.085 μs 3.147 μs 3.196 μs 3.214 μs
geometric_jacobian! 149.416 μs 4.286 μs 4.082 μs 4.027 μs 4.048 μs
constraint_bias! 1.299 ms 14.783 μs 15.263 μs 15.230 μs 15.238 μs
gravitational_potential_energy 105.668 μs 3.041 μs 3.977 μs 3.147 μs 3.392 μs
inverse_dynamics! 472.323 μs 6.660 μs 7.745 μs 7.352 μs 6.821 μs
constraint_jacobian! 691.560 μs 53.780 μs 54.215 μs 53.991 μs 51.407 μs
mass_matrix! and geometric_jacobian! 274.236 μs 13.455 μs 12.106 μs 12.284 μs 12.252 μs
momentum_rate_bias 522.313 μs 6.411 μs 6.292 μs 6.517 μs 6.409 μs
kinetic_energy 220.578 μs 4.702 μs 4.871 μs 4.781 μs 4.601 μs

Here's how to reproduce the results in the 1st timings column:

cd ~/git/RigidBodyDynamics.jl/perf                              # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no                     # Start a Julia 1.7.3 REPL
] add ..#b7388e086d6cd1b8d38002d692de6d7acd0570b7               # Explicitly add RigidBodyDynamics.jl 2.3.1
] up                                                            # Update all the dependencies

include("runbenchmarks.jl")                                     # Run the benchmarks

And here is how to reproduce the results in the 2nd timings column:

cd ~/git/RigidBodyDynamics.jl/perf                              # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no                     # Start a Julia 1.7.3 REPL
] add ..#b7388e086d6cd1b8d38002d692de6d7acd0570b7               # Explicitly add RigidBodyDynamics.jl 2.3.1
] up                                                            # Update all the dependencies

] add https://github.com/ferrolho/Rotations.jl#rbdjl-issue-500  # Add patch to Rotations.jl 1.3.0

include("runbenchmarks.jl")                                     # Run the benchmarks

As for the remaining columns, I collected them in a similar fashion but using Julia 1.8.5. As you can see, there is no difference in performance between single and multi-threaded BLAS in Julia 1.8.5 (in contrast to what we see for Julia 1.7.3) ─ so maybe we can close this issue?

As for the difference in performance due to Rotations.jl, it seems that on Julia 1.8.5 and with the latest version of RBD.jl (on master), we don't observe the close-to-1-ms times, so the patch that I've applied in this discovery has probably already been applied in one form or another in recent versions of that package.

So, having both those things sorted, the only outstanding issue I can see is with StaticArrays.jl, which on my Dell XPS 13 has increased the mean time of dynamics! in this benchmark from ~19 μs to ~24 μs.

@rdeits
Copy link
Collaborator

rdeits commented May 11, 2023

Wow, this is excellent. Thanks for the detailed investigation!

I'm a little confused about what you said here:

As for the difference in performance due to Rotations.jl, it seems that on Julia 1.8.5 and with the latest version of RBD.jl (on master), we don't observe the close-to-1-ms times, so the patch that I've applied in this discovery has probably already been applied in one form or another in recent versions of that package.

I wonder if that's just because the master version of RBD.jl is holding Rotations.jl back to some version prior to the bad 1.3.0 version? In any case, I think we should:

  1. Try to get your patch merged into Rotations.jl and get them to tag a new release
  2. Restrict RBD.jl to Rotations.jl < 1.3 until that happens.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants