Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adapt auxiliary math functions for Float32 #2048

Merged
merged 11 commits into from
Aug 22, 2024
Merged

adapt auxiliary math functions for Float32 #2048

merged 11 commits into from
Aug 22, 2024

Conversation

ranocha
Copy link
Member

@ranocha ranocha commented Aug 21, 2024

Checking more items of #591 as discussed in #2042 (comment):

  • shock capturing (checked for examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl)

Notes

I used the pattern

# For `Float64`, this gives 1.8189894035458565e-12
# For `Float32`, this gives 1.1920929f-5
atol = max(100 * eps(RealT), eps(RealT)^convert(RealT, 0.75f0))

to replace hardcoded 1e-12.

Issues

Copy link
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

@ranocha ranocha mentioned this pull request Aug 21, 2024
24 tasks
Copy link

codecov bot commented Aug 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.30%. Comparing base (20ab97b) to head (fb0366f).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2048   +/-   ##
=======================================
  Coverage   96.30%   96.30%           
=======================================
  Files         467      467           
  Lines       37256    37263    +7     
=======================================
+ Hits        35879    35886    +7     
  Misses       1377     1377           
Flag Coverage Δ
unittests 96.30% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ranocha
Copy link
Member Author

ranocha commented Aug 21, 2024

julia> redirect_stdio(stdout = devnull, stderr = devnull) do
           trixi_include("examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl", tspan=(0.0f0, 1.0f0), surface_flux = flux_central, volume_flux = flux_central, volume_flux_dg = flux_central, volume_flux_fv = flux_central)
       end

julia> function doit!(du_ode, u_ode, semi, n)
           for _ in 1:n
               Trixi.rhs!(du_ode, u_ode, semi, zero(eltype(u_ode)));
           end
       end
doit! (generic function with 1 method)

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode); u = Trixi.wrap_array(u_ode, semi); du = Trixi.wrap_array(du_ode, semi);

julia> doit!(du_ode, u_ode, semi, 1) # compile

julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 7.66774e9 │
│                    CPU_CLK_UNHALTED_CORE │  1.9519e9 │
│                     CPU_CLK_UNHALTED_REF │ 1.30181e9 │
│                            TOPDOWN_SLOTS │  9.7594e9 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌─────────────────────────┬──────────┐
│                  Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│     Runtime (RDTSC) [s] │ 0.593699 │
│    Runtime unhalted [s] │  0.88932 │
│             Clock [MHz] │  3290.88 │
│                     CPI │  0.25456 │
│            DP [MFLOP/s] │      0.0 │
│        AVX DP [MFLOP/s] │      0.0 │
│     AVX512 DP [MFLOP/s] │      0.0 │
│        Packed [MUOPS/s] │      0.0 │
│        Scalar [MUOPS/s] │      0.0 │
│ Vectorization ratio [%] │      NaN │
└─────────────────────────┴──────────┘

@ranocha
Copy link
Member Author

ranocha commented Aug 21, 2024

We get Float64 computations internally from log, e.g. in Base: https://github.com/JuliaLang/julia/blob/54142b7a8255d2231af8182eae52417eb6c72327/base/special/log.jl#L215-L255

julia> redirect_stdio(stdout = devnull, stderr = devnull) do
           trixi_include("examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl", tspan=(0.0f0, 1.0f0))
       end

julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);

Group: FLOPS_DP
┌──────────────────────────────────────────┬────────────┐
│                                    Event │   Thread 1 │
├──────────────────────────────────────────┼────────────┤
│                        INSTR_RETIRED_ANY │ 1.10631e10 │
│                    CPU_CLK_UNHALTED_CORE │  4.44565e9 │
│                     CPU_CLK_UNHALTED_REF │  2.94592e9 │
│                            TOPDOWN_SLOTS │ 2.22278e10 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │        0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │  2.71716e8 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │        0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │        0.0 │
└──────────────────────────────────────────┴────────────┘
┌─────────────────────────┬──────────┐
│                  Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│     Runtime (RDTSC) [s] │  1.34572 │
│    Runtime unhalted [s] │  2.02552 │
│             Clock [MHz] │  3312.18 │
│                     CPI │ 0.401847 │
│            DP [MFLOP/s] │  201.911 │
│        AVX DP [MFLOP/s] │      0.0 │
│     AVX512 DP [MFLOP/s] │      0.0 │
│        Packed [MUOPS/s] │      0.0 │
│        Scalar [MUOPS/s] │  201.911 │
│ Vectorization ratio [%] │      0.0 │
└─────────────────────────┴──────────┘

However, GPU computing will use specialized implementations as far as I understand, so this should not be a problem. As far as I can tell, the log is the only reason we get the Float64 computations here - with the diff

shell> git diff
diff --git a/src/auxiliary/math.jl b/src/auxiliary/math.jl
index fa816da9a..81b1c46d1 100644
--- a/src/auxiliary/math.jl
+++ b/src/auxiliary/math.jl
@@ -176,15 +176,15 @@ Given ε = 1.0e-4, we use the following algorithm.
 @inline function ln_mean(x::RealT, y::RealT) where {RealT <: Real}
     epsilon_f2 = convert(RealT, 1.0e-4)
     f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
-    if f2 < epsilon_f2
+    # if f2 < epsilon_f2
         return (x + y) / @evalpoly(f2,
                          2,
                          convert(RealT, 2 / 3),
                          convert(RealT, 2 / 5),
                          convert(RealT, 2 / 7))
-    else
-        return (y - x) / log(y / x)
-    end
+    # else
+    #     return (y - x) / log(y / x)
+    # end
 end
 
 """
@@ -202,15 +202,15 @@ multiplication.
 @inline function inv_ln_mean(x::RealT, y::RealT) where {RealT <: Real}
     epsilon_f2 = convert(RealT, 1.0e-4)
     f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
-    if f2 < epsilon_f2
+    # if f2 < epsilon_f2
         return @evalpoly(f2,
                          2,
                          convert(RealT, 2 / 3),
                          convert(RealT, 2 / 5),
                          convert(RealT, 2 / 7)) / (x + y)
-    else
-        return log(y / x) / (y - x)
-    end
+    # else
+    #     return log(y / x) / (y - x)
+    # end
 end

I get

julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);

Group: FLOPS_DP
┌──────────────────────────────────────────┬────────────┐
│                                    Event │   Thread 1 │
├──────────────────────────────────────────┼────────────┤
│                        INSTR_RETIRED_ANY │  9.45534e9 │
│                    CPU_CLK_UNHALTED_CORE │  3.54369e9 │
│                     CPU_CLK_UNHALTED_REF │  2.36548e9 │
│                            TOPDOWN_SLOTS │ 1.77174e10 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │        0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │        0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │        0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │        0.0 │
└──────────────────────────────────────────┴────────────┘
┌─────────────────────────┬──────────┐
│                  Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│     Runtime (RDTSC) [s] │  1.07998 │
│    Runtime unhalted [s] │  1.61457 │
│             Clock [MHz] │  3288.04 │
│                     CPI │ 0.374782 │
│            DP [MFLOP/s] │      0.0 │
│        AVX DP [MFLOP/s] │      0.0 │
│     AVX512 DP [MFLOP/s] │      0.0 │
│        Packed [MUOPS/s] │      0.0 │
│        Scalar [MUOPS/s] │      0.0 │
│ Vectorization ratio [%] │      NaN │
└─────────────────────────┴──────────┘

@huiyuxie
Copy link
Member

huiyuxie commented Aug 22, 2024

Good!

However, GPU computing will use specialized implementations as far as I understand, so this should not be a problem

  • Let me check - yes it is overriden by @device_override calling to extern __nv_log
  • The broken tests can be fixed back to tests once this is merged

We get Float64 computations internally from log in Base

I have some questions. Are you using the default setting, i.e., set_log_type("log_Trixi_NaN"), or the custom setting, i.e., set_log_type("log_Base")? If you are under the custom setting, it quite makes sense. But if you are under the default setting,

@inline log(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x)
@inline log(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64,), x)
@inline log(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32,), x)
@inline log(x::Float16) = ccall("llvm.log.f16", llvmcall, Float16, (Float16,), x)
does it mean that the LLVM project also use double precision intrinsically when the inputs are single precision?

I'm not sure whether I understand correctly and I'm just curious @ranocha

@ranocha
Copy link
Member Author

ranocha commented Aug 22, 2024

does it mean that the LLVM project also use double precision intrinsically when the inputs are single precision?

Based on my observations, I would say yes - but I don't have a link to the source code for LLVM. Here are some results:

julia> using LIKWID

julia> @inline llvm_log(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64,), x)
llvm_log (generic function with 1 method)

julia> @inline llvm_log(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32,), x)
llvm_log (generic function with 2 methods)

julia> @inline llvm_log(x::Float16) = ccall("llvm.log.f16", llvmcall, Float16, (Float16,), x)
llvm_log (generic function with 3 methods)

julia> @inline llvm_log(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x)
llvm_log (generic function with 4 methods)

julia> x = abs.(1.0f3 * randn(Float32, 10^3));

julia> sum(Base.log, x)
6244.008f0

julia> sum(llvm_log, x)
6244.011f0

julia> @perfmon "FLOPS_DP" sum(Base.log, x);

Group: FLOPS_DP
┌──────────────────────────────────────────┬──────────┐
│                                    Event │ Thread 1 │
├──────────────────────────────────────────┼──────────┤
│                        INSTR_RETIRED_ANY │  72873.0 │
│                    CPU_CLK_UNHALTED_CORE │  57160.0 │
│                     CPU_CLK_UNHALTED_REF │  57288.0 │
│                            TOPDOWN_SLOTS │ 285800.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │   3000.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │      0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │      0.0 │
└──────────────────────────────────────────┴──────────┘
┌─────────────────────────┬────────────┐
│                  Metric │   Thread 1 │
├─────────────────────────┼────────────┤
│     Runtime (RDTSC) [s] │ 3.00598e-5 │
│    Runtime unhalted [s] │  2.6043e-5 │
│             Clock [MHz] │    2189.92 │
│                     CPI │   0.784378 │
│            DP [MFLOP/s] │    99.8012 │
│        AVX DP [MFLOP/s] │        0.0 │
│     AVX512 DP [MFLOP/s] │        0.0 │
│        Packed [MUOPS/s] │        0.0 │
│        Scalar [MUOPS/s] │    99.8012 │
│ Vectorization ratio [%] │        0.0 │
└─────────────────────────┴────────────┘

julia> @perfmon "FLOPS_DP" sum(llvm_log, x);

Group: FLOPS_DP
┌──────────────────────────────────────────┬──────────┐
│                                    Event │ Thread 1 │
├──────────────────────────────────────────┼──────────┤
│                        INSTR_RETIRED_ANY │  41451.0 │
│                    CPU_CLK_UNHALTED_CORE │  52758.0 │
│                     CPU_CLK_UNHALTED_REF │  52800.0 │
│                            TOPDOWN_SLOTS │ 263790.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │  12000.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │      0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │      0.0 │
└──────────────────────────────────────────┴──────────┘
┌─────────────────────────┬────────────┐
│                  Metric │   Thread 1 │
├─────────────────────────┼────────────┤
│     Runtime (RDTSC) [s] │ 4.77559e-5 │
│    Runtime unhalted [s] │ 2.40374e-5 │
│             Clock [MHz] │    2193.08 │
│                     CPI │    1.27278 │
│            DP [MFLOP/s] │    251.278 │
│        AVX DP [MFLOP/s] │        0.0 │
│     AVX512 DP [MFLOP/s] │        0.0 │
│        Packed [MUOPS/s] │        0.0 │
│        Scalar [MUOPS/s] │    251.278 │
│ Vectorization ratio [%] │        0.0 │
└─────────────────────────┴────────────┘

@ranocha
Copy link
Member Author

ranocha commented Aug 22, 2024

Good!

However, GPU computing will use specialized implementations as far as I understand, so this should not be a problem

  • Let me check
  • The broken tests can be fixed back to tests once this is merged

https://cuda.juliagpu.org/stable/api/kernel/#Math could be a good starting point

Copy link
Member

@andrewwinters5000 andrewwinters5000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good. Mostly just formatting questions.

test/test_p4est_2d.jl Show resolved Hide resolved
test/test_type.jl Show resolved Hide resolved
test/test_type.jl Show resolved Hide resolved
test/test_type.jl Show resolved Hide resolved
test/test_type.jl Show resolved Hide resolved
test/test_type.jl Show resolved Hide resolved
test/test_type.jl Show resolved Hide resolved
test/test_type.jl Show resolved Hide resolved
Copy link
Member

@andrewwinters5000 andrewwinters5000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ranocha ranocha merged commit 8d7b20a into main Aug 22, 2024
38 checks passed
@ranocha ranocha deleted the hr/ln_mean branch August 22, 2024 12:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants