-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BLAS.set_num_threads(1)
drastically improves dynamics!
performance for Atlas benchmark
#500
Comments
With MKL instead of OpenBLAS this is not an issue. I suspect OpenMathLib/OpenBLAS#1622. |
Can you get 'perf record "command" ; perf report" from both cases? Just in case it is something else than _GEMM, ones with less dimensions can be calibrated faster. Julia's |
@brada4, thank you for looking at this. Here's what I did (verbose, mostly just to make this reproducible for myself). I created a using RigidBodyDynamics
using AtlasRobot
using Random
function run_benchmark(result, state, n)
for i = 1 : n
rand!(state)
dynamics!(result, state)
end
return nothing
end
function main()
mechanism = AtlasRobot.mechanism()
state = MechanismState(mechanism)
result = DynamicsResult(mechanism)
@time run_benchmark(result, state, 1_000_000)
end
main() Running this benchmark with I then ran sudo perf record julia --project -O3 --check-bounds=no test.jl # sudo to avoid 'Couldn't record kernel reference relocation symbol' warning (after running Output: 36.554262 seconds (12.29 M allocations: 557.993 MiB, 0.81% gc time)
[ perf record: Woken up 160 times to write data ]
[ perf record: Captured and wrote 43.549 MB perf.data (1141059 samples) ] Top entries from 12.07% julia libopenblas64_.so [.] dtrsm_kernel_RN_HASWELL
10.85% julia [kernel.kallsyms] [k] syscall_return_via_sysret
4.57% julia libpthread-2.27.so [.] __pthread_mutex_lock
4.32% julia [kernel.kallsyms] [k] __schedule
3.81% julia libopenblas64_.so [.] dgemm_kernel_HASWELL
3.41% julia libopenblas64_.so [.] dtrsm_oltncopy_HASWELL
3.21% julia [kernel.kallsyms] [k] update_curr
2.68% julia [kernel.kallsyms] [k] entry_SYSCALL_64_after_hwframe
2.44% julia [kernel.kallsyms] [k] do_syscall_64
2.30% julia [kernel.kallsyms] [k] pick_next_task_fair
1.83% julia libc-2.27.so [.] __sched_yield
1.77% julia [kernel.kallsyms] [k] __calc_delta
1.41% julia libopenblas64_.so [.] blas_thread_server
1.40% julia [kernel.kallsyms] [k] yield_task_fair
1.35% julia [unknown] [k] 0xfffffe00000e201b
1.34% julia [unknown] [k] 0xfffffe000016601b
1.33% julia [unknown] [k] 0xfffffe000019201b
1.33% julia [unknown] [k] 0xfffffe000013a01b
1.25% julia [kernel.kallsyms] [k] sys_sched_yield
1.21% julia [kernel.kallsyms] [k] pick_next_entity
1.17% julia [unknown] [k] 0xfffffe000005e01b
1.15% julia [kernel.kallsyms] [k] rcu_note_context_switch
1.08% julia libpthread-2.27.so [.] __pthread_mutex_unlock
1.06% julia [unknown] [k] 0xfffffe000008a01b
1.04% julia [unknown] [k] 0xfffffe00000b601b
1.04% julia libopenblas64_.so [.] dgemm_itcopy_HASWELL
1.04% julia [kernel.kallsyms] [k] _raw_spin_lock
1.01% julia [kernel.kallsyms] [k] __indirect_thunk_start
0.93% julia [kernel.kallsyms] [k] native_sched_clock
0.89% julia [kernel.kallsyms] [k] cpuacct_charge
0.87% julia [kernel.kallsyms] [k] update_min_vruntime
0.84% julia [kernel.kallsyms] [k] entry_SYSCALL_64_stage2
0.76% julia [kernel.kallsyms] [k] schedule
0.75% julia [kernel.kallsyms] [k] update_rq_clock
0.75% julia [kernel.kallsyms] [k] cgroup_cpu_stat_updated
0.66% julia [kernel.kallsyms] [k] __cgroup_account_cputime
0.51% julia [kernel.kallsyms] [k] sched_clock_cpu The results above are all without calling 4.92% julia libopenblas64_.so [.] dgemm_kernel_HASWELL
2.08% julia libjulia.so.1.0 [.] jl_gc_pool_alloc
1.84% julia libjulia.so.1.0 [.] forall_exists_subtype
1.83% julia libjulia.so.1.0 [.] obviously_disjoint.part.14
1.79% julia libjulia.so.1.0 [.] subtype
1.71% julia libjulia.so.1.0 [.] gc_sweep_pool
1.63% julia libjulia.so.1.0 [.] gc_mark_loop
1.52% julia libjulia.so.1.0 [.] jl_unwrap_unionall
1.40% julia libopenblas64_.so [.] dgemv_n_HASWELL
1.34% julia libopenblas64_.so [.] dtrsm_kernel_RN_HASWELL
1.18% julia libjulia.so.1.0 [.] jl_apply_generic
1.06% julia libjulia.so.1.0 [.] jl_f_getfield
1.01% julia libopenblas64_.so [.] dsyrk_kernel_L
0.96% julia libopenblas64_.so [.] dgemm_beta_HASWELL
0.92% julia libjulia.so.1.0 [.] jl_field_index
0.89% julia libopenblas64_.so [.] dgemm_otcopy_HASWELL
0.85% julia libjulia.so.1.0 [.] jl_alloc_array_1d
0.78% julia libopenblas64_.so [.] dtrsm_iltncopy_HASWELL
0.71% julia libc-2.27.so [.] cfree@GLIBC_2.2.5
0.69% julia libopenblas64_.so [.] dtrsm_ilnncopy_HASWELL
0.67% julia libopenblas64_.so [.] dpotf2_L
0.65% julia libc-2.27.so [.] __memmove_avx_unaligned_erms
0.65% julia sys.so [.] japi1_cache_lookup_658.clone_1.clone_2
0.59% julia libjulia.so.1.0 [.] has_free_typevars
0.58% julia libjulia.so.1.0 [.] jl_typemap_entry_assoc_exact
0.56% julia libjulia.so.1.0 [.] lookup_type_idx.isra.12
0.56% julia libopenblas64_.so [.] dtrsm_RN_solve_opt
0.55% julia libopenblas64_.so [.] dgemm_itcopy_HASWELL
0.55% julia libjulia.so.1.0 [.] forall_exists_equal.part.23
0.54% julia libjulia.so.1.0 [.] typekey_compare.part.11
0.53% julia libopenblas64_.so [.] dtrsm_kernel_LN_HASWELL
0.50% julia libopenblas64_.so [.] dot_compute Julia version info: julia> using InteractiveUtils; versioninfo()
Julia Version 1.0.3
Commit 099e826241 (2018-12-18 01:34 UTC)
Platform Info:
OS: Linux (x86_64-pc-linux-gnu)
CPU: Intel(R) Core(TM) i7-6950X CPU @ 3.00GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, broadwell)
julia> using LinearAlgebra; BLAS.openblas_get_config()
"USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell MAX_THREADS=16" This is on an i7-6950X CPU @ 3.00GHz (10 physical cores, 20 hyperthreaded). Let me know if you need the |
There is no need for perf.data, it is nearly impossible to decode on even slightly different system, does not look very specific to your setup, just that 10/20 cores dramaticize excess parallelism effects better than more common CPUs. |
OK, thanks, I'll track that issue. |
Hello! I am wondering if this is still an issue? Here are my results on an Apple MacBook Air (M1, 2020): --- results-1-thread.txt 2023-05-06 12:49:34
+++ results-4-threads.txt 2023-05-06 12:49:22
@@ -1,105 +1,105 @@
dynamics_bias!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 3.054 μs … 8.408 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 3.129 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 3.145 μs ± 107.831 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 3.054 μs … 8.871 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 3.112 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 3.124 μs ± 120.414 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
dynamics!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 10.492 μs … 262.429 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 10.608 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 10.710 μs ± 2.535 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 10.508 μs … 26.962 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 10.600 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 10.612 μs ± 181.451 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
momentum:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.771 μs … 4.638 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.821 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.823 μs ± 57.144 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.767 μs … 4.458 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.850 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.850 μs ± 60.572 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
mass_matrix!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 3.800 μs … 12.929 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 3.946 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 3.971 μs ± 210.354 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 3.808 μs … 6.621 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 3.900 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 3.933 μs ± 132.276 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
momentum_matrix!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 2.112 μs … 11.562 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 2.188 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 2.190 μs ± 114.719 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 2.104 μs … 4.821 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 2.167 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 2.170 μs ± 59.499 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
center_of_mass:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.171 μs … 4.283 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.221 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.223 μs ± 65.128 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.175 μs … 4.196 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.225 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.228 μs ± 57.789 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
geometric_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.600 μs … 4.229 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.650 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.656 μs ± 99.179 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.583 μs … 3.050 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.637 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.641 μs ± 37.095 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
constraint_bias!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 5.871 μs … 14.075 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 5.967 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 6.060 μs ± 253.963 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 5.921 μs … 762.712 μs ┊ GC (min … max): 0.00% … 98.98%
+ Time (median): 6.017 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 6.172 μs ± 7.569 μs ┊ GC (mean ± σ): 1.22% ± 0.99%
Memory estimate: 496 bytes, allocs estimate: 1.
gravitational_potential_energy:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.183 μs … 2.342 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.242 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.241 μs ± 22.244 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.175 μs … 3.083 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.238 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.240 μs ± 56.878 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
inverse_dynamics!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 2.683 μs … 13.863 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 2.771 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 2.788 μs ± 162.449 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 2.683 μs … 5.279 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 2.746 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 2.754 μs ± 96.378 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
simulate tree:
BenchmarkTools.Trial: 9 samples with 10 evaluations.
- Range (min … max): 49.602 ms … 50.468 ms ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 49.889 ms ┊ GC (median): 0.00%
- Time (mean ± σ): 49.976 ms ± 284.056 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 49.606 ms … 52.434 ms ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 50.114 ms ┊ GC (median): 0.00%
+ Time (mean ± σ): 50.338 ms ± 820.355 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
constraint_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 14.696 μs … 832.163 μs ┊ GC (min … max): 0.00% … 98.02%
- Time (median): 15.104 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 15.251 μs ± 8.177 μs ┊ GC (mean ± σ): 0.53% ± 0.98%
+ Range (min … max): 14.762 μs … 8.446 ms ┊ GC (min … max): 0.00% … 99.78%
+ Time (median): 15.196 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 17.100 μs ± 88.417 μs ┊ GC (mean ± σ): 4.93% ± 1.00%
Memory estimate: 496 bytes, allocs estimate: 1.
mass_matrix! and geometric_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 4.071 μs … 7.154 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 4.188 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 4.207 μs ± 96.458 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 4.050 μs … 17.783 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 4.225 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 4.273 μs ± 343.992 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
momentum_rate_bias:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 2.554 μs … 5.183 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 2.617 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 2.624 μs ± 105.863 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 2.575 μs … 5.446 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 2.654 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 2.670 μs ± 101.424 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0.
kinetic_energy:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Range (min … max): 1.758 μs … 3.658 μs ┊ GC (min … max): 0.00% … 0.00%
- Time (median): 1.817 μs ┊ GC (median): 0.00%
- Time (mean ± σ): 1.818 μs ± 44.967 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Range (min … max): 1.771 μs … 4.075 μs ┊ GC (min … max): 0.00% … 0.00%
+ Time (median): 1.854 μs ┊ GC (median): 0.00%
+ Time (mean ± σ): 1.855 μs ± 64.257 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
Memory estimate: 0 bytes, allocs estimate: 0. Here's the output of julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65e* (2023-01-08 06:45 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin22.1.0)
CPU: 8 × Apple M1
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, apple-m1)
Threads: 1 on 4 virtual cores And here is how I ran the benchmarks:
I ran this once with the contents of |
On my old (but trustworthy) Dell XPS 13 9343 (2015) using Julia 1.8.5, I also cannot observe any noticeable differences in performance: --- results-single-thread.txt 2023-05-06 21:07:21.302449867 +0100
+++ results-multi-thread.txt 2023-05-06 21:07:26.526504507 +0100
@@ -1,59 +1,59 @@
dynamics_bias!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 7.248 μs ± 454.342 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 7.252 μs ± 295.422 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
dynamics!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 24.448 μs ± 1.159 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 24.113 μs ± 1.609 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
momentum:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 4.582 μs ± 435.294 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 4.694 μs ± 276.963 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
mass_matrix!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 12.483 μs ± 553.176 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 11.935 μs ± 523.177 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
momentum_matrix!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 5.256 μs ± 250.141 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 5.262 μs ± 517.004 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
center_of_mass:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 3.007 μs ± 280.680 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 3.013 μs ± 222.686 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
geometric_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 3.950 μs ± 198.062 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 4.064 μs ± 493.502 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
constraint_bias!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 14.426 μs ± 17.260 μs ┊ GC (mean ± σ): 1.19% ± 0.99%
+ Time (mean ± σ): 14.329 μs ± 16.452 μs ┊ GC (mean ± σ): 1.14% ± 0.99%
gravitational_potential_energy:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 3.001 μs ± 273.534 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 3.047 μs ± 243.480 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
inverse_dynamics!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 6.627 μs ± 410.959 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 6.737 μs ± 501.527 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
simulate tree:
BenchmarkTools.Trial: 4 samples with 10 evaluations.
- Time (mean ± σ): 116.726 ms ± 766.823 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 114.619 ms ± 1.122 ms ┊ GC (mean ± σ): 0.00% ± 0.00%
constraint_jacobian!:
-BenchmarkTools.Trial: 9410 samples with 10 evaluations.
- Time (mean ± σ): 50.597 μs ± 19.226 μs ┊ GC (mean ± σ): 0.39% ± 1.00%
+BenchmarkTools.Trial: 9236 samples with 10 evaluations.
+ Time (mean ± σ): 51.511 μs ± 25.033 μs ┊ GC (mean ± σ): 0.00% ± 0.00%
mass_matrix! and geometric_jacobian!:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 12.391 μs ± 452.979 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 12.118 μs ± 715.884 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
momentum_rate_bias:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 6.308 μs ± 521.010 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 6.290 μs ± 357.568 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
kinetic_energy:
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
- Time (mean ± σ): 4.585 μs ± 357.360 ns ┊ GC (mean ± σ): 0.00% ± 0.00%
+ Time (mean ± σ): 4.571 μs ± 251.657 ns ┊ GC (mean ± σ): 0.00% ± 0.00% Here's the output of julia> versioninfo()
Julia Version 1.8.5
Commit 17cfb8e65ea (2023-01-08 06:45 UTC)
Platform Info:
OS: Linux (x86_64-linux-gnu)
CPU: 4 × Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-13.0.1 (ORCJIT, broadwell)
Threads: 1 on 4 virtual cores |
I did some further investigation and I've found that there was a massive leap in performance introduced by Julia version 1.8. The plots below should demonstrate my findings a bit better. The benchmark data and the script to generate the plot above can be found in this GitHub Gist. |
Interesting. I'd be excited to see things get that much faster from a newer Julia version, but I wonder if something else is going on here. As an example, the pre-1.8 results for
That's much more in line with what you saw in 1.8 and later, and it's more like what I would have expected to begin with. |
You're right! There's something else going on... I guess I got a bit too carried away with the idea of having a big improvement from a major Julia version upgrade. Let's focus on just using Julia 1.7.3, otherwise there's too many changing environment factors. I've been trying to identify where the performance is taking a hit, and I've found that, using Julia v1.7.3, the degradation takes place somewhere between RBD.jl versions 2.3.0 and 2.3.1 ─ but it may not necessarily be a problem with RBD.jl (it may be one of its dependencies).
So, here is what I did step-by-step on my Dell XPS 13: ~/git/RigidBodyDynamics.jl/perf # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
] add ..#b3a632064f1e2b731c5bb532201dcdb7a3567ea5 # Explicitly add RigidBodyDynamics v2.3.0
] up # Update all the dependencies
CTRL + D # Stop the Julia REPL session for the changes to deps versions to take effect
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
include("runbenchmarks.jl") # Run the benchmarks Then I did the same for RBD.jl 2.3.1: ~/git/RigidBodyDynamics.jl/perf # Navigate into the `perf` folder of RBD.jl
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
] add ..#b7388e086d6cd1b8d38002d692de6d7acd0570b7 # Explicitly add RigidBodyDynamics v2.3.1
] up # Update all the dependencies
CTRL + D # Stop the Julia REPL session for the changes to deps versions to take effect
julia-1.7.3 --project -O3 --check-bounds=no # Start a Julia 1.7.3 REPL
include("runbenchmarks.jl") # Run the benchmarks The dependencies' version changes look like this: Here are the results for those two different versions: Table I. Time (mean) per method, for RBD.jl 2.3.0 vs 2.3.1.
Julia 1.7.3 with RBD.jl 2.3.0 produced the values which I think make the most sense (given Twan's previous benchmarks and Robin's comment above) ─ and so I am going to take those as my "ground truth" from now on. Next, I focused on investigating changes to dependencies versions, while keeping Julia at v1.7.3 and RBD.jl at 2.3.1. The table below shows the results of those benchmarks. RBD.jl 2.3.1 with its default dependencies (1st results column) is the version where the performance has taken a hit, so what I tried was to downgrade some of its dependencies based on the image I shared above. I tried downgrading Rotations.jl (2nd results column), and then StaticArrays.jl as well (3rd results column). Table II. Time (mean) per method, for RBD.jl 2.3.1 with different dependencies versions.
Looking at the results in Table II, it seems the main performance hit is from Rotations.jl And that's it for now... 🙁 This is where I am at with my current investigation. I will try to dig a bit further tomorrow. |
It is now clear to me that the performance hit was introduced with Rotations.jl Another dependency that changes when upgrading from Rotations.jl At this point, I went through the commits that make up Rotations.jl release Using some intuition I tried to revert the changes to the Table III. Time (mean) per method, for RBD.jl 2.3.1 and StaticArrays.jl 1.5.24, without and with a patched Rotations.jl 1.3.0.
Here's how to reproduce the results in the 1st timings column:
And here is how to reproduce the results in the 2nd timings column:
As for the remaining columns, I collected them in a similar fashion but using Julia 1.8.5. As you can see, there is no difference in performance between single and multi-threaded BLAS in Julia 1.8.5 (in contrast to what we see for Julia 1.7.3) ─ so maybe we can close this issue? As for the difference in performance due to Rotations.jl, it seems that on Julia 1.8.5 and with the latest version of RBD.jl (on master), we don't observe the close-to-1-ms times, so the patch that I've applied in this discovery has probably already been applied in one form or another in recent versions of that package. So, having both those things sorted, the only outstanding issue I can see is with StaticArrays.jl, which on my Dell XPS 13 has increased the mean time of dynamics! in this benchmark from ~19 μs to ~24 μs. |
Wow, this is excellent. Thanks for the detailed investigation! I'm a little confused about what you said here:
I wonder if that's just because the master version of RBD.jl is holding Rotations.jl back to some version prior to the bad 1.3.0 version? In any case, I think we should:
|
Found this out this week. Baseline on current master:
(a bit slower than http://www.juliarobotics.org/RigidBodyDynamics.jl/latest/benchmarks.html for some reason, by the way)
After
using LinearAlgebra; BLAS.set_num_threads(1)
:So multiple BLAS threads aren't doing us any favor (with the default OpenBLAS anyway), at least for fewer-than-Atlas degrees of freedom.
Not sure how to act on this since this is a global setting. At least
BLAS.set_num_threads
is fast now (about 4 ns, no allocations), so it could be called before and afterdynamics_solve!
. But is that thread safe? And how do we get the number of BLAS threads before callingBLAS.set_num_threads(1)
so that it can be reset to the original value? Maybe it should just be a performance tip?The text was updated successfully, but these errors were encountered: