-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adapt auxiliary math functions for Float32
#2048
Conversation
Review checklistThis checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging. Purpose and scope
Code quality
Documentation
Testing
Performance
Verification
Created with ❤️ by the Trixi.jl community. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #2048 +/- ##
=======================================
Coverage 96.30% 96.30%
=======================================
Files 467 467
Lines 37256 37263 +7
=======================================
+ Hits 35879 35886 +7
Misses 1377 1377
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
julia> redirect_stdio(stdout = devnull, stderr = devnull) do
trixi_include("examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl", tspan=(0.0f0, 1.0f0), surface_flux = flux_central, volume_flux = flux_central, volume_flux_dg = flux_central, volume_flux_fv = flux_central)
end
julia> function doit!(du_ode, u_ode, semi, n)
for _ in 1:n
Trixi.rhs!(du_ode, u_ode, semi, zero(eltype(u_ode)));
end
end
doit! (generic function with 1 method)
julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode); u = Trixi.wrap_array(u_ode, semi); du = Trixi.wrap_array(du_ode, semi);
julia> doit!(du_ode, u_ode, semi, 1) # compile
julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);
Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│ Event │ Thread 1 │
├──────────────────────────────────────────┼───────────┤
│ INSTR_RETIRED_ANY │ 7.66774e9 │
│ CPU_CLK_UNHALTED_CORE │ 1.9519e9 │
│ CPU_CLK_UNHALTED_REF │ 1.30181e9 │
│ TOPDOWN_SLOTS │ 9.7594e9 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │ 0.0 │
└──────────────────────────────────────────┴───────────┘
┌─────────────────────────┬──────────┐
│ Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│ Runtime (RDTSC) [s] │ 0.593699 │
│ Runtime unhalted [s] │ 0.88932 │
│ Clock [MHz] │ 3290.88 │
│ CPI │ 0.25456 │
│ DP [MFLOP/s] │ 0.0 │
│ AVX DP [MFLOP/s] │ 0.0 │
│ AVX512 DP [MFLOP/s] │ 0.0 │
│ Packed [MUOPS/s] │ 0.0 │
│ Scalar [MUOPS/s] │ 0.0 │
│ Vectorization ratio [%] │ NaN │
└─────────────────────────┴──────────┘ |
We get julia> redirect_stdio(stdout = devnull, stderr = devnull) do
trixi_include("examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl", tspan=(0.0f0, 1.0f0))
end
julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);
Group: FLOPS_DP
┌──────────────────────────────────────────┬────────────┐
│ Event │ Thread 1 │
├──────────────────────────────────────────┼────────────┤
│ INSTR_RETIRED_ANY │ 1.10631e10 │
│ CPU_CLK_UNHALTED_CORE │ 4.44565e9 │
│ CPU_CLK_UNHALTED_REF │ 2.94592e9 │
│ TOPDOWN_SLOTS │ 2.22278e10 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │ 2.71716e8 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │ 0.0 │
└──────────────────────────────────────────┴────────────┘
┌─────────────────────────┬──────────┐
│ Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│ Runtime (RDTSC) [s] │ 1.34572 │
│ Runtime unhalted [s] │ 2.02552 │
│ Clock [MHz] │ 3312.18 │
│ CPI │ 0.401847 │
│ DP [MFLOP/s] │ 201.911 │
│ AVX DP [MFLOP/s] │ 0.0 │
│ AVX512 DP [MFLOP/s] │ 0.0 │
│ Packed [MUOPS/s] │ 0.0 │
│ Scalar [MUOPS/s] │ 201.911 │
│ Vectorization ratio [%] │ 0.0 │
└─────────────────────────┴──────────┘ However, GPU computing will use specialized implementations as far as I understand, so this should not be a problem. As far as I can tell, the shell> git diff
diff --git a/src/auxiliary/math.jl b/src/auxiliary/math.jl
index fa816da9a..81b1c46d1 100644
--- a/src/auxiliary/math.jl
+++ b/src/auxiliary/math.jl
@@ -176,15 +176,15 @@ Given ε = 1.0e-4, we use the following algorithm.
@inline function ln_mean(x::RealT, y::RealT) where {RealT <: Real}
epsilon_f2 = convert(RealT, 1.0e-4)
f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
- if f2 < epsilon_f2
+ # if f2 < epsilon_f2
return (x + y) / @evalpoly(f2,
2,
convert(RealT, 2 / 3),
convert(RealT, 2 / 5),
convert(RealT, 2 / 7))
- else
- return (y - x) / log(y / x)
- end
+ # else
+ # return (y - x) / log(y / x)
+ # end
end
"""
@@ -202,15 +202,15 @@ multiplication.
@inline function inv_ln_mean(x::RealT, y::RealT) where {RealT <: Real}
epsilon_f2 = convert(RealT, 1.0e-4)
f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
- if f2 < epsilon_f2
+ # if f2 < epsilon_f2
return @evalpoly(f2,
2,
convert(RealT, 2 / 3),
convert(RealT, 2 / 5),
convert(RealT, 2 / 7)) / (x + y)
- else
- return log(y / x) / (y - x)
- end
+ # else
+ # return log(y / x) / (y - x)
+ # end
end I get julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);
Group: FLOPS_DP
┌──────────────────────────────────────────┬────────────┐
│ Event │ Thread 1 │
├──────────────────────────────────────────┼────────────┤
│ INSTR_RETIRED_ANY │ 9.45534e9 │
│ CPU_CLK_UNHALTED_CORE │ 3.54369e9 │
│ CPU_CLK_UNHALTED_REF │ 2.36548e9 │
│ TOPDOWN_SLOTS │ 1.77174e10 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │ 0.0 │
└──────────────────────────────────────────┴────────────┘
┌─────────────────────────┬──────────┐
│ Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│ Runtime (RDTSC) [s] │ 1.07998 │
│ Runtime unhalted [s] │ 1.61457 │
│ Clock [MHz] │ 3288.04 │
│ CPI │ 0.374782 │
│ DP [MFLOP/s] │ 0.0 │
│ AVX DP [MFLOP/s] │ 0.0 │
│ AVX512 DP [MFLOP/s] │ 0.0 │
│ Packed [MUOPS/s] │ 0.0 │
│ Scalar [MUOPS/s] │ 0.0 │
│ Vectorization ratio [%] │ NaN │
└─────────────────────────┴──────────┘ |
Good!
I have some questions. Are you using the default setting, i.e., Trixi.jl/src/auxiliary/math.jl Lines 119 to 123 in 20ab97b
I'm not sure whether I understand correctly and I'm just curious @ranocha |
Based on my observations, I would say yes - but I don't have a link to the source code for LLVM. Here are some results: julia> using LIKWID
julia> @inline llvm_log(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64,), x)
llvm_log (generic function with 1 method)
julia> @inline llvm_log(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32,), x)
llvm_log (generic function with 2 methods)
julia> @inline llvm_log(x::Float16) = ccall("llvm.log.f16", llvmcall, Float16, (Float16,), x)
llvm_log (generic function with 3 methods)
julia> @inline llvm_log(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x)
llvm_log (generic function with 4 methods)
julia> x = abs.(1.0f3 * randn(Float32, 10^3));
julia> sum(Base.log, x)
6244.008f0
julia> sum(llvm_log, x)
6244.011f0
julia> @perfmon "FLOPS_DP" sum(Base.log, x);
Group: FLOPS_DP
┌──────────────────────────────────────────┬──────────┐
│ Event │ Thread 1 │
├──────────────────────────────────────────┼──────────┤
│ INSTR_RETIRED_ANY │ 72873.0 │
│ CPU_CLK_UNHALTED_CORE │ 57160.0 │
│ CPU_CLK_UNHALTED_REF │ 57288.0 │
│ TOPDOWN_SLOTS │ 285800.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │ 3000.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │ 0.0 │
└──────────────────────────────────────────┴──────────┘
┌─────────────────────────┬────────────┐
│ Metric │ Thread 1 │
├─────────────────────────┼────────────┤
│ Runtime (RDTSC) [s] │ 3.00598e-5 │
│ Runtime unhalted [s] │ 2.6043e-5 │
│ Clock [MHz] │ 2189.92 │
│ CPI │ 0.784378 │
│ DP [MFLOP/s] │ 99.8012 │
│ AVX DP [MFLOP/s] │ 0.0 │
│ AVX512 DP [MFLOP/s] │ 0.0 │
│ Packed [MUOPS/s] │ 0.0 │
│ Scalar [MUOPS/s] │ 99.8012 │
│ Vectorization ratio [%] │ 0.0 │
└─────────────────────────┴────────────┘
julia> @perfmon "FLOPS_DP" sum(llvm_log, x);
Group: FLOPS_DP
┌──────────────────────────────────────────┬──────────┐
│ Event │ Thread 1 │
├──────────────────────────────────────────┼──────────┤
│ INSTR_RETIRED_ANY │ 41451.0 │
│ CPU_CLK_UNHALTED_CORE │ 52758.0 │
│ CPU_CLK_UNHALTED_REF │ 52800.0 │
│ TOPDOWN_SLOTS │ 263790.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │ 12000.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │ 0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │ 0.0 │
└──────────────────────────────────────────┴──────────┘
┌─────────────────────────┬────────────┐
│ Metric │ Thread 1 │
├─────────────────────────┼────────────┤
│ Runtime (RDTSC) [s] │ 4.77559e-5 │
│ Runtime unhalted [s] │ 2.40374e-5 │
│ Clock [MHz] │ 2193.08 │
│ CPI │ 1.27278 │
│ DP [MFLOP/s] │ 251.278 │
│ AVX DP [MFLOP/s] │ 0.0 │
│ AVX512 DP [MFLOP/s] │ 0.0 │
│ Packed [MUOPS/s] │ 0.0 │
│ Scalar [MUOPS/s] │ 251.278 │
│ Vectorization ratio [%] │ 0.0 │
└─────────────────────────┴────────────┘ |
https://cuda.juliagpu.org/stable/api/kernel/#Math could be a good starting point |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good. Mostly just formatting questions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Checking more items of #591 as discussed in #2042 (comment):
examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl
)Notes
I used the pattern
to replace hardcoded
1e-12
.Issues
examples/p4est_2d_dgsem/elixir_euler_blast_wave_amr.jl
toFloat32
, I noticed quite a lot ofFloat64
computations coming fromcalc_boundary_flux!
. It runs out that theBoundaryConditionDirichlet(initial_condition)
was problematic, since theinitial_condition
usessincos(::Float32)
, which uses https://github.com/JuliaLang/julia/blob/86cba99f6f79bfc6b67e52f0575de211109b638c/base/special/trig.jl#L8-L10, i.e.,Float64
computations after the range reduction https://github.com/JuliaLang/julia/blob/86cba99f6f79bfc6b67e52f0575de211109b638c/base/special/rem_pio2.jl#L282-L293