adapt auxiliary math functions for `Float32` #2048

ranocha · 2024-08-21T08:13:50Z

Checking more items of #591 as discussed in #2042 (comment):

shock capturing (checked for examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl)

Notes

I used the pattern

# For `Float64`, this gives 1.8189894035458565e-12
# For `Float32`, this gives 1.1920929f-5
atol = max(100 * eps(RealT), eps(RealT)^convert(RealT, 0.75f0))

to replace hardcoded 1e-12.

Issues

When converting examples/p4est_2d_dgsem/elixir_euler_blast_wave_amr.jl to Float32, I noticed quite a lot of Float64 computations coming from calc_boundary_flux!. It runs out that the BoundaryConditionDirichlet(initial_condition) was problematic, since the initial_condition uses sincos(::Float32), which uses https://github.com/JuliaLang/julia/blob/86cba99f6f79bfc6b67e52f0575de211109b638c/base/special/trig.jl#L8-L10, i.e., Float64 computations after the range reduction https://github.com/JuliaLang/julia/blob/86cba99f6f79bfc6b67e52f0575de211109b638c/base/special/rem_pio2.jl#L282-L293

github-actions · 2024-08-21T08:14:03Z

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

The PR has a single goal that is clear from the PR title and/or description.
All code changes represent a single set of modifications that logically belong together.
No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

The code can be understood easily.
Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
There are no redundancies that can be removed by simple modularization/refactoring.
There are no leftover debug statements or commented code sections.
The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

New functions and types are documented with a docstring or top-level comment.
Relevant publications are referenced in docstrings (see example for formatting).
Inline comments are used to document longer or unusual code sections.
Comments describe intent ("why?") and not just functionality ("what?").
If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

The PR passes all tests.
New or modified lines of code are covered by tests.
New or modified tests run in less then 10 seconds.

Performance

There are no type instabilities or memory allocations in performance-critical parts.
If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

The correctness of the code was verified using appropriate tests.
If new equations/methods are added, a convergence test has been run and the results
are posted in the PR.

Created with ❤️ by the Trixi.jl community.

src/auxiliary/math.jl

codecov · 2024-08-21T08:51:13Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.30%. Comparing base (20ab97b) to head (fb0366f).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2048   +/-   ##
=======================================
  Coverage   96.30%   96.30%           
=======================================
  Files         467      467           
  Lines       37256    37263    +7     
=======================================
+ Hits        35879    35886    +7     
  Misses       1377     1377

Flag	Coverage Δ
unittests	`96.30% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ranocha · 2024-08-21T14:35:14Z

julia> redirect_stdio(stdout = devnull, stderr = devnull) do
           trixi_include("examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl", tspan=(0.0f0, 1.0f0), surface_flux = flux_central, volume_flux = flux_central, volume_flux_dg = flux_central, volume_flux_fv = flux_central)
       end

julia> function doit!(du_ode, u_ode, semi, n)
           for _ in 1:n
               Trixi.rhs!(du_ode, u_ode, semi, zero(eltype(u_ode)));
           end
       end
doit! (generic function with 1 method)

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode); u = Trixi.wrap_array(u_ode, semi); du = Trixi.wrap_array(du_ode, semi);

julia> doit!(du_ode, u_ode, semi, 1) # compile

julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 7.66774e9 │
│                    CPU_CLK_UNHALTED_CORE │  1.9519e9 │
│                     CPU_CLK_UNHALTED_REF │ 1.30181e9 │
│                            TOPDOWN_SLOTS │  9.7594e9 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌─────────────────────────┬──────────┐
│                  Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│     Runtime (RDTSC) [s] │ 0.593699 │
│    Runtime unhalted [s] │  0.88932 │
│             Clock [MHz] │  3290.88 │
│                     CPI │  0.25456 │
│            DP [MFLOP/s] │      0.0 │
│        AVX DP [MFLOP/s] │      0.0 │
│     AVX512 DP [MFLOP/s] │      0.0 │
│        Packed [MUOPS/s] │      0.0 │
│        Scalar [MUOPS/s] │      0.0 │
│ Vectorization ratio [%] │      NaN │
└─────────────────────────┴──────────┘

ranocha · 2024-08-21T14:40:18Z

We get Float64 computations internally from log, e.g. in Base: https://github.com/JuliaLang/julia/blob/54142b7a8255d2231af8182eae52417eb6c72327/base/special/log.jl#L215-L255

julia> redirect_stdio(stdout = devnull, stderr = devnull) do
           trixi_include("examples/p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl", tspan=(0.0f0, 1.0f0))
       end

julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);

Group: FLOPS_DP
┌──────────────────────────────────────────┬────────────┐
│                                    Event │   Thread 1 │
├──────────────────────────────────────────┼────────────┤
│                        INSTR_RETIRED_ANY │ 1.10631e10 │
│                    CPU_CLK_UNHALTED_CORE │  4.44565e9 │
│                     CPU_CLK_UNHALTED_REF │  2.94592e9 │
│                            TOPDOWN_SLOTS │ 2.22278e10 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │        0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │  2.71716e8 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │        0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │        0.0 │
└──────────────────────────────────────────┴────────────┘
┌─────────────────────────┬──────────┐
│                  Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│     Runtime (RDTSC) [s] │  1.34572 │
│    Runtime unhalted [s] │  2.02552 │
│             Clock [MHz] │  3312.18 │
│                     CPI │ 0.401847 │
│            DP [MFLOP/s] │  201.911 │
│        AVX DP [MFLOP/s] │      0.0 │
│     AVX512 DP [MFLOP/s] │      0.0 │
│        Packed [MUOPS/s] │      0.0 │
│        Scalar [MUOPS/s] │  201.911 │
│ Vectorization ratio [%] │      0.0 │
└─────────────────────────┴──────────┘

However, GPU computing will use specialized implementations as far as I understand, so this should not be a problem. As far as I can tell, the log is the only reason we get the Float64 computations here - with the diff

shell> git diff
diff --git a/src/auxiliary/math.jl b/src/auxiliary/math.jl
index fa816da9a..81b1c46d1 100644
--- a/src/auxiliary/math.jl
+++ b/src/auxiliary/math.jl
@@ -176,15 +176,15 @@ Given ε = 1.0e-4, we use the following algorithm.
 @inline function ln_mean(x::RealT, y::RealT) where {RealT <: Real}
     epsilon_f2 = convert(RealT, 1.0e-4)
     f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
-    if f2 < epsilon_f2
+    # if f2 < epsilon_f2
         return (x + y) / @evalpoly(f2,
                          2,
                          convert(RealT, 2 / 3),
                          convert(RealT, 2 / 5),
                          convert(RealT, 2 / 7))
-    else
-        return (y - x) / log(y / x)
-    end
+    # else
+    #     return (y - x) / log(y / x)
+    # end
 end
 
 """
@@ -202,15 +202,15 @@ multiplication.
 @inline function inv_ln_mean(x::RealT, y::RealT) where {RealT <: Real}
     epsilon_f2 = convert(RealT, 1.0e-4)
     f2 = (x * (x - 2 * y) + y * y) / (x * (x + 2 * y) + y * y) # f2 = f^2
-    if f2 < epsilon_f2
+    # if f2 < epsilon_f2
         return @evalpoly(f2,
                          2,
                          convert(RealT, 2 / 3),
                          convert(RealT, 2 / 5),
                          convert(RealT, 2 / 7)) / (x + y)
-    else
-        return log(y / x) / (y - x)
-    end
+    # else
+    #     return log(y / x) / (y - x)
+    # end
 end

I get

julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^3);

Group: FLOPS_DP
┌──────────────────────────────────────────┬────────────┐
│                                    Event │   Thread 1 │
├──────────────────────────────────────────┼────────────┤
│                        INSTR_RETIRED_ANY │  9.45534e9 │
│                    CPU_CLK_UNHALTED_CORE │  3.54369e9 │
│                     CPU_CLK_UNHALTED_REF │  2.36548e9 │
│                            TOPDOWN_SLOTS │ 1.77174e10 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │        0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │        0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │        0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │        0.0 │
└──────────────────────────────────────────┴────────────┘
┌─────────────────────────┬──────────┐
│                  Metric │ Thread 1 │
├─────────────────────────┼──────────┤
│     Runtime (RDTSC) [s] │  1.07998 │
│    Runtime unhalted [s] │  1.61457 │
│             Clock [MHz] │  3288.04 │
│                     CPI │ 0.374782 │
│            DP [MFLOP/s] │      0.0 │
│        AVX DP [MFLOP/s] │      0.0 │
│     AVX512 DP [MFLOP/s] │      0.0 │
│        Packed [MUOPS/s] │      0.0 │
│        Scalar [MUOPS/s] │      0.0 │
│ Vectorization ratio [%] │      NaN │
└─────────────────────────┴──────────┘

huiyuxie · 2024-08-22T02:23:21Z

Good!

However, GPU computing will use specialized implementations as far as I understand, so this should not be a problem

Let me check - yes it is overriden by @device_override calling to extern __nv_log
The broken tests can be fixed back to tests once this is merged

We get Float64 computations internally from log in Base

I have some questions. Are you using the default setting, i.e., set_log_type("log_Trixi_NaN"), or the custom setting, i.e., set_log_type("log_Base")? If you are under the custom setting, it quite makes sense. But if you are under the default setting,

Trixi.jl/src/auxiliary/math.jl

Lines 119 to 123 in 20ab97b

    
           @inline log(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x) 
        
           @inline log(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64,), x) 
        
           @inline log(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32,), x) 
        
           @inline log(x::Float16) = ccall("llvm.log.f16", llvmcall, Float16, (Float16,), x)

does it mean that the LLVM project also use double precision intrinsically when the inputs are single precision?

I'm not sure whether I understand correctly and I'm just curious @ranocha

ranocha · 2024-08-22T05:54:36Z

does it mean that the LLVM project also use double precision intrinsically when the inputs are single precision?

Based on my observations, I would say yes - but I don't have a link to the source code for LLVM. Here are some results:

julia> using LIKWID

julia> @inline llvm_log(x::Float64) = ccall("llvm.log.f64", llvmcall, Float64, (Float64,), x)
llvm_log (generic function with 1 method)

julia> @inline llvm_log(x::Float32) = ccall("llvm.log.f32", llvmcall, Float32, (Float32,), x)
llvm_log (generic function with 2 methods)

julia> @inline llvm_log(x::Float16) = ccall("llvm.log.f16", llvmcall, Float16, (Float16,), x)
llvm_log (generic function with 3 methods)

julia> @inline llvm_log(x::Real) = x < zero(x) ? oftype(x, NaN) : Base.log(x)
llvm_log (generic function with 4 methods)

julia> x = abs.(1.0f3 * randn(Float32, 10^3));

julia> sum(Base.log, x)
6244.008f0

julia> sum(llvm_log, x)
6244.011f0

julia> @perfmon "FLOPS_DP" sum(Base.log, x);

Group: FLOPS_DP
┌──────────────────────────────────────────┬──────────┐
│                                    Event │ Thread 1 │
├──────────────────────────────────────────┼──────────┤
│                        INSTR_RETIRED_ANY │  72873.0 │
│                    CPU_CLK_UNHALTED_CORE │  57160.0 │
│                     CPU_CLK_UNHALTED_REF │  57288.0 │
│                            TOPDOWN_SLOTS │ 285800.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │   3000.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │      0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │      0.0 │
└──────────────────────────────────────────┴──────────┘
┌─────────────────────────┬────────────┐
│                  Metric │   Thread 1 │
├─────────────────────────┼────────────┤
│     Runtime (RDTSC) [s] │ 3.00598e-5 │
│    Runtime unhalted [s] │  2.6043e-5 │
│             Clock [MHz] │    2189.92 │
│                     CPI │   0.784378 │
│            DP [MFLOP/s] │    99.8012 │
│        AVX DP [MFLOP/s] │        0.0 │
│     AVX512 DP [MFLOP/s] │        0.0 │
│        Packed [MUOPS/s] │        0.0 │
│        Scalar [MUOPS/s] │    99.8012 │
│ Vectorization ratio [%] │        0.0 │
└─────────────────────────┴────────────┘

julia> @perfmon "FLOPS_DP" sum(llvm_log, x);

Group: FLOPS_DP
┌──────────────────────────────────────────┬──────────┐
│                                    Event │ Thread 1 │
├──────────────────────────────────────────┼──────────┤
│                        INSTR_RETIRED_ANY │  41451.0 │
│                    CPU_CLK_UNHALTED_CORE │  52758.0 │
│                     CPU_CLK_UNHALTED_REF │  52800.0 │
│                            TOPDOWN_SLOTS │ 263790.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │  12000.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │      0.0 │
│ FP_ARITH_INST_RETIRED_512B_PACKED_DOUBLE │      0.0 │
└──────────────────────────────────────────┴──────────┘
┌─────────────────────────┬────────────┐
│                  Metric │   Thread 1 │
├─────────────────────────┼────────────┤
│     Runtime (RDTSC) [s] │ 4.77559e-5 │
│    Runtime unhalted [s] │ 2.40374e-5 │
│             Clock [MHz] │    2193.08 │
│                     CPI │    1.27278 │
│            DP [MFLOP/s] │    251.278 │
│        AVX DP [MFLOP/s] │        0.0 │
│     AVX512 DP [MFLOP/s] │        0.0 │
│        Packed [MUOPS/s] │        0.0 │
│        Scalar [MUOPS/s] │    251.278 │
│ Vectorization ratio [%] │        0.0 │
└─────────────────────────┴────────────┘

ranocha · 2024-08-22T06:01:32Z

Good!

However, GPU computing will use specialized implementations as far as I understand, so this should not be a problem

Let me check

The broken tests can be fixed back to tests once this is merged

https://cuda.juliagpu.org/stable/api/kernel/#Math could be a good starting point

andrewwinters5000

Overall this looks good. Mostly just formatting questions.

test/test_p4est_2d.jl

test/test_type.jl

andrewwinters5000

LGTM!

ranocha added 2 commits August 21, 2024 10:06

adapt auxiliary math functions

845dc90

adapt tests

6c5a8a2

ranocha mentioned this pull request Aug 21, 2024

Add tests using other real types #591

Open

24 tasks

ranocha commented Aug 21, 2024

View reviewed changes

src/auxiliary/math.jl Show resolved Hide resolved

ranocha added 6 commits August 21, 2024 12:12

more tests and fix shock capturing volume integral

baa044a

fix typo in tests

ad04ca8

p4est_2d_dgsem/elixir_euler_shockcapturing_ec_float32.jl

0c1ea8d

format

2567fb5

fix stolarsky_mean test

f152c86

use Float32 for test tolerances

e3a187a

format

f0b6cb0

ranocha requested a review from andrewwinters5000 August 21, 2024 14:44

increase tolerance for CI

88038dc

format

fb0366f

andrewwinters5000 requested changes Aug 22, 2024

View reviewed changes

ranocha requested a review from andrewwinters5000 August 22, 2024 08:45

andrewwinters5000 approved these changes Aug 22, 2024

View reviewed changes

ranocha merged commit 8d7b20a into main Aug 22, 2024
38 checks passed

ranocha deleted the hr/ln_mean branch August 22, 2024 12:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adapt auxiliary math functions for `Float32` #2048

adapt auxiliary math functions for `Float32` #2048

ranocha commented Aug 21, 2024 •

edited

Loading

github-actions bot commented Aug 21, 2024

codecov bot commented Aug 21, 2024 •

edited

Loading

ranocha commented Aug 21, 2024

ranocha commented Aug 21, 2024 •

edited

Loading

huiyuxie commented Aug 22, 2024 •

edited

Loading

ranocha commented Aug 22, 2024

ranocha commented Aug 22, 2024

andrewwinters5000 left a comment

andrewwinters5000 left a comment

adapt auxiliary math functions for Float32 #2048

adapt auxiliary math functions for Float32 #2048

Conversation

ranocha commented Aug 21, 2024 • edited Loading

Notes

Issues

github-actions bot commented Aug 21, 2024

Review checklist

Purpose and scope

Code quality

Documentation

Testing

Performance

Verification

codecov bot commented Aug 21, 2024 • edited Loading

Codecov Report

ranocha commented Aug 21, 2024

ranocha commented Aug 21, 2024 • edited Loading

huiyuxie commented Aug 22, 2024 • edited Loading

ranocha commented Aug 22, 2024

ranocha commented Aug 22, 2024

andrewwinters5000 left a comment

Choose a reason for hiding this comment

andrewwinters5000 left a comment

Choose a reason for hiding this comment

adapt auxiliary math functions for `Float32` #2048

adapt auxiliary math functions for `Float32` #2048

ranocha commented Aug 21, 2024 •

edited

Loading

codecov bot commented Aug 21, 2024 •

edited

Loading

ranocha commented Aug 21, 2024 •

edited

Loading

huiyuxie commented Aug 22, 2024 •

edited

Loading