computations using only `Float32` and no `Float64` #2042

ranocha · 2024-08-15T19:32:38Z

Due to the great work of @huiyuxie, we're on a good way to support running computations only in Float32. However, checking that we're indeed using only Float32 and no Float64 everywhere is not trivial as far as I know. This may also be interesting for @patrickersing.

The best option that comes to my mind is using LIKWID.jl. Installing it (Linux only!) and using the additional patch

--- a/src/auxiliary/auxiliary.jl
+++ b/src/auxiliary/auxiliary.jl
@@ -21,7 +21,7 @@ runtime of all measurements added so far via `take!(counter)`, resetting the
 """
 mutable struct PerformanceCounter
     ncalls_since_readout::Int
-    runtime::Float64
+    runtime::Float32
 end

I get the following results:

julia> using LIKWID, Revise; using Trixi
Precompiling Trixi

julia> redirect_stdout(devnull) do 
           trixi_include("examples/structured_2d_dgsem/elixir_advection_float32.jl")
       end

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode);

julia> function doit!(du_ode, u_ode, semi, n)
           for _ in 1:n
               Trixi.rhs!(du_ode, u_ode, semi, zero(eltype(u_ode)));
           end
       end

julia> doit!(du_ode, u_ode, semi, 1) # compile

julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^4);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.0181e10 │
│                    CPU_CLK_UNHALTED_CORE │ 3.17159e9 │
│                     CPU_CLK_UNHALTED_REF │  2.5141e9 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬──────────┐
│               Metric │ Thread 1 │
├──────────────────────┼──────────┤
│  Runtime (RDTSC) [s] │  0.71929 │
│ Runtime unhalted [s] │ 0.858109 │
│          Clock [MHz] │  4662.61 │
│                  CPI │  0.31152 │
│         DP [MFLOP/s] │      0.0 │
│     AVX DP [MFLOP/s] │      0.0 │
│     Packed [MUOPS/s] │      0.0 │
│     Scalar [MUOPS/s] │      0.0 │
│  Vectorization ratio │      NaN │
└──────────────────────┴──────────┘

This confirms that our Trixi.rhs! evaluation does indeed not use any Float64 computations in this case 🥳

For completeness, I also get

julia> @perfmon "FLOPS_SP" doit!(du_ode, u_ode, semi, 10^4);

Group: FLOPS_SP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.0181e10 │
│                    CPU_CLK_UNHALTED_CORE │ 3.19003e9 │
│                     CPU_CLK_UNHALTED_REF │ 2.52266e9 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_SINGLE │ 1.41313e9 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬──────────┐
│               Metric │ Thread 1 │
├──────────────────────┼──────────┤
│  Runtime (RDTSC) [s] │ 0.719212 │
│ Runtime unhalted [s] │ 0.863098 │
│          Clock [MHz] │   4673.8 │
│                  CPI │ 0.313331 │
│         SP [MFLOP/s] │  1964.83 │
│     AVX SP [MFLOP/s] │      0.0 │
│     Packed [MUOPS/s] │      0.0 │
│     Scalar [MUOPS/s] │  1964.83 │
│  Vectorization ratio │      0.0 │
└──────────────────────┴──────────┘

Solving without callbacks is also fine (using the time step size roughly given by the stepsize_callback):

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.03995e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.28811e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.59127e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       2.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │  0.00754828 │
│ Runtime unhalted [s] │  0.00889635 │
│          Clock [MHz] │     4689.96 │
│                  CPI │    0.316179 │
│         DP [MFLOP/s] │ 0.000264961 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │ 0.000264961 │
│  Vectorization ratio │         0.0 │
└──────────────────────┴─────────────┘

julia> @perfmon "FLOPS_SP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false);

Group: FLOPS_SP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.03995e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.27587e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.58112e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE │    6608.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_SINGLE │ 1.42731e7 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE │  270336.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│               Metric │   Thread 1 │
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │ 0.00739226 │
│ Runtime unhalted [s] │ 0.00886322 │
│          Clock [MHz] │    4690.87 │
│                  CPI │   0.315001 │
│         SP [MFLOP/s] │    2226.95 │
│     AVX SP [MFLOP/s] │    292.561 │
│     Packed [MUOPS/s] │     37.464 │
│     Scalar [MUOPS/s] │    1930.82 │
│  Vectorization ratio │    1.90339 │
└──────────────────────┴────────────┘

The StepsizeCallback is also fine:

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false, callback = stepsize_callback);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 9.60974e7 │
│                    CPU_CLK_UNHALTED_CORE │ 3.03774e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.33938e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│               Metric │   Thread 1 │
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │ 0.00644704 │
│ Runtime unhalted [s] │ 0.00821915 │
│          Clock [MHz] │    4799.25 │
│                  CPI │   0.316111 │
│         DP [MFLOP/s] │        0.0 │
│     AVX DP [MFLOP/s] │        0.0 │
│     Packed [MUOPS/s] │        0.0 │
│     Scalar [MUOPS/s] │        0.0 │
│  Vectorization ratio │        NaN │
└──────────────────────┴────────────┘

However, the SummaryCallback uses a tiny amount of Float64 computations:

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false, callback = summary_callback);
[...]
Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.05099e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.42389e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.69335e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       4.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │  0.00779737 │
│ Runtime unhalted [s] │  0.00926393 │
│          Clock [MHz] │      4698.4 │
│                  CPI │    0.325776 │
│         DP [MFLOP/s] │ 0.000512994 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │ 0.000512994 │
│  Vectorization ratio │         0.0 │
└──────────────────────┴─────────────┘

The AnalysisCallback uses quite some Float64 computations:

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false, callback = analysis_callback);
[...]
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.10713e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.59729e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.83354e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      11.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │  385295.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│               Metric │   Thread 1 │
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │  0.0085065 │
│ Runtime unhalted [s] │ 0.00973311 │
│          Clock [MHz] │    4692.13 │
│                  CPI │   0.324922 │
│         DP [MFLOP/s] │    45.2968 │
│     AVX DP [MFLOP/s] │        0.0 │
│     Packed [MUOPS/s] │ 0.00129313 │
│     Scalar [MUOPS/s] │    45.2942 │
│  Vectorization ratio │ 0.00285487 │
└──────────────────────┴────────────┘

The SaveSolutionCallback uses a tiny amount of Float64 computations, too:

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false, callback = save_solution);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.05923e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.77849e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.97858e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       8.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │  0.00988881 │
│ Runtime unhalted [s] │   0.0102234 │
│          Clock [MHz] │      4688.5 │
│                  CPI │    0.356721 │
│         DP [MFLOP/s] │ 0.000808995 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │ 0.000808995 │
│  Vectorization ratio │         0.0 │
└──────────────────────┴─────────────┘

TODO for this PR

Basic check of mesh types etc. that work (xref Add tests using other real types #591)
Add new elixirs to the CI tests

Open questions for discussion

Shall we have the option to use only Float32 for stuff like the PerformanceCounter?
What about callbacks?

Open questions that will not be resolved here

Installing LIKWID is quite tedious. I don't think it's an option with our current CI system. Thus, the nice tests can only be done manually right now.
How many examples shall we provide?
What shall we test?

github-actions · 2024-08-15T19:32:53Z

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

The PR has a single goal that is clear from the PR title and/or description.
All code changes represent a single set of modifications that logically belong together.
No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

The code can be understood easily.
Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
There are no redundancies that can be removed by simple modularization/refactoring.
There are no leftover debug statements or commented code sections.
The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

New functions and types are documented with a docstring or top-level comment.
Relevant publications are referenced in docstrings (see example for formatting).
Inline comments are used to document longer or unusual code sections.
Comments describe intent ("why?") and not just functionality ("what?").
If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

The PR passes all tests.
New or modified lines of code are covered by tests.
New or modified tests run in less then 10 seconds.

Performance

There are no type instabilities or memory allocations in performance-critical parts.
If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

The correctness of the code was verified using appropriate tests.
If new equations/methods are added, a convergence test has been run and the results
are posted in the PR.

Created with ❤️ by the Trixi.jl community.

codecov · 2024-08-15T19:57:27Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.30%. Comparing base (afd8060) to head (6dee45a).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2042   +/-   ##
=======================================
  Coverage   96.30%   96.30%           
=======================================
  Files         466      467    +1     
  Lines       37227    37256   +29     
=======================================
+ Hits        35850    35879   +29     
  Misses       1377     1377

Flag	Coverage Δ
unittests	`96.30% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

huiyuxie · 2024-08-16T00:28:21Z

Thanks!!

I'm not sure whether using Float32 for mesh and callbacks is meaningful here, but for me, it's important as it would be good to ensure that things can be accelerated on the GPU as much as possible. I will enhance them in the future regardless of whether it's needed here or not.

I have a question about the original purpose of adding Float32 to this repository. I can see that adding higher precision numbers (like Float128) helps achieve more precise results, but in the case of Float32, there doesn't seem to be much speed gain on CPU. I can't see any goods except providing support for GPU execution. @ranocha

ranocha · 2024-08-16T04:04:33Z

have a question about the original purpose of adding Float32 to this repository. I can see that adding higher precision numbers (like Float128) helps achieve more precise results, but in the case of Float32, there doesn't seem to be much speed gain on CPU. I can't see any goods except providing support for GPU execution.

That's indeed the main motivation

…ec_float32.jl

ranocha · 2024-08-16T05:36:44Z

julia> redirect_stdout(devnull) do 
           trixi_include("examples/unstructured_2d_dgsem/elixir_shallowwater_ec_float32.jl")
       end
[ Info: You just called `trixi_include`. Julia may now compile the code, please be patient.

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode);

julia> @perfmon "FLOPS_DP" Trixi.rhs!(du_ode, u_ode, semi, 0.0f0);

Group: FLOPS_DP
┌──────────────────────────────────────────┬──────────┐
│                                    Event │ Thread 1 │
├──────────────────────────────────────────┼──────────┤
│                        INSTR_RETIRED_ANY │ 868636.0 │
│                    CPU_CLK_UNHALTED_CORE │ 373225.0 │
│                     CPU_CLK_UNHALTED_REF │ 287364.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │      0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │      0.0 │
└──────────────────────────────────────────┴──────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │  9.39318e-5 │
│ Runtime unhalted [s] │ 0.000100979 │
│          Clock [MHz] │     4800.41 │
│                  CPI │    0.429668 │
│         DP [MFLOP/s] │         0.0 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │         0.0 │
│  Vectorization ratio │         NaN │
└──────────────────────┴─────────────┘

ranocha · 2024-08-16T12:27:43Z

julia> trixi_include("examples/p4est_3d_dgsem/elixir_euler_free_stream_boundaries_float32.jl")

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode);

julia> @perfmon "FLOPS_DP" Trixi.rhs!(du_ode, u_ode, semi, 0.0f0);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 4.45956e6 │
│                    CPU_CLK_UNHALTED_CORE │ 1.91896e6 │
│                     CPU_CLK_UNHALTED_REF │  1.5092e6 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │ 0.000478556 │
│ Runtime unhalted [s] │ 0.000519195 │
│          Clock [MHz] │     4699.53 │
│                  CPI │    0.430302 │
│         DP [MFLOP/s] │         0.0 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │         0.0 │
│  Vectorization ratio │         NaN │
└──────────────────────┴─────────────┘

ranocha · 2024-08-16T13:20:44Z

FDSBP upwind works, too:

julia> trixi_include("examples/unstructured_2d_fdsbp/elixir_euler_free_stream_upwind_float32.jl", tspan = (0.0f0, 0.1f0))

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode);

julia> @perfmon "FLOPS_DP" Trixi.rhs!(du_ode, u_ode, semi, 0.0f0);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 2.59028e7 │
│                    CPU_CLK_UNHALTED_CORE │ 1.44735e7 │
│                     CPU_CLK_UNHALTED_REF │  1.1153e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│               Metric │   Thread 1 │
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │  0.0033605 │
│ Runtime unhalted [s] │ 0.00391595 │
│          Clock [MHz] │    4796.45 │
│                  CPI │   0.558763 │
│         DP [MFLOP/s] │        0.0 │
│     AVX DP [MFLOP/s] │        0.0 │
│     Packed [MUOPS/s] │        0.0 │
│     Scalar [MUOPS/s] │        0.0 │
│  Vectorization ratio │        NaN │
└──────────────────────┴────────────┘

huiyuxie · 2024-08-17T04:44:11Z

Solving without callbacks is also fine

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false);

But the result shows that Float64 operations actually occurred during the process. Could you please check again and see where they come from?

Also, you are all referring to type stability during the call of solve, but I also care about type stability during mesh initialization (and callback initialization if applicable). For example, there are much space for speedup for StructuredMesh if you look into https://github.com/trixi-framework/Trixi.jl/blob/main/src/meshes/structured_mesh.jl (even small speedup is also meaningful for me). Could you please update the description here and xref to clarify that it only refers to type stability during the solving process? This is different from general type stability, as we also have some hardcoded Float64 for StructuredMesh like here

Trixi.jl/src/meshes/structured_mesh.jl

Lines 161 to 164 in b9ace6d

    
           # Interpolate linearly between left and right value where s should be between -1 and 1 
        
           function linear_interpolate(s, left_value, right_value) 
        
               0.5 * ((1 - s) * left_value + (1 + s) * right_value) 
        
           end

What I plan to do is to perform mesh initialization on the GPU and keep the data on the GPU to avoid the overhead of data transfer between the CPU and GPU. I can handle the type stability during initialization myself. @ranocha

ranocha · 2024-08-17T08:45:36Z

But the result shows that Float64 operations actually occurred during the process. Could you please check again and see where they come from?

I need to get access to the system again but it may have been that I was running the code without the PerformanceCounter fix (since I accidentally used git commit -a and then needed to revert it again, maybe I forgot something 🤷). But it's just a tiny amount:

julia> 0.000264961 * 0.00754828 # MFLOP/s * time in seconds
1.99999981708e-6

So that's like 2 FLOPs.

ranocha · 2024-08-17T08:48:28Z

Also, you are all referring to type stability during the call of solve,

It's even less than that - what I have measured here is whether there are Float64 computations in rhs! or solve. There could still be type instabilities somewhere, but that's not accounted for, and it only tested implicitly with our allocation tests.

ranocha · 2024-08-17T08:49:25Z

I also care about type stability during mesh initialization (and callback initialization if applicable).

That's a good point 👍

For example, there are much space for speedup for StructuredMesh if you look into https://github.com/trixi-framework/Trixi.jl/blob/main/src/meshes/structured_mesh.jl (even small speedup is also meaningful for me). Could you please update the description here and xref to clarify that it only refers to type stability during the solving process?

Done

ranocha · 2024-08-17T08:50:04Z

What I plan to do is to perform mesh initialization on the GPU and keep the data on the GPU to avoid the overhead of data transfer between the CPU and GPU. I can handle the type stability during initialization myself.

Do you plan to do this in your repo or also with PRs to this repo, e.g., to fix the problem you pointed at above?

huiyuxie · 2024-08-19T21:41:12Z

So that's like 2 FLOPs.

That's fair. Then for SummaryCallback we get around 4 FLOPs.

julia> 0.000512994 * 0.00779737
4.00000402578e-6

And for SaveSolutionCallback we get around 8 FLOPs.

julia> 0.000808995 * 0.00988881
7.99999784595e-6

It does not look reasonable for me to conclude SummaryCallback and SaveSolutionCallback are not fine but saying solving without callbacks is fine as I cannot see big difference between 2 FLOPs, 4 FLOPs, and 8 FLOPs.

Installing LIKWID is quite tedious. I don't think it's an option with our current CI system. Thus, the nice tests can only be done manually right now.

I'm more than curious about this question. Do you have any good ideas (or see from other repos) for designing thorough tests for it, except for the memory allocation tests you are currently using?

Do you plan to do this in your repo or also with PRs to this repo, e.g., to fix the problem you pointed at above?

Certainly. But I'm not sure whether to implement it in my repo or in this repo. Could you see any benefits in making Float32 work for mesh initialization in your repo? If so, I can start by trying the improvement in your repo, and then see how it could be better implemented in my repo.

@ranocha

ranocha · 2024-08-20T06:21:44Z

Installing LIKWID is quite tedious. I don't think it's an option with our current CI system. Thus, the nice tests can only be done manually right now.

I'm more than curious about this question. Do you have any good ideas (or see from other repos) for designing thorough tests for it, except for the memory allocation tests you are currently using?

No

Do you plan to do this in your repo or also with PRs to this repo, e.g., to fix the problem you pointed at above?

Certainly. But I'm not sure whether to implement it in my repo or in this repo. Could you see any benefits in making Float32 work for mesh initialization in your repo? If so, I can start by trying the improvement in your repo, and then see how it could be better implemented in my repo.

Well, if it's just fixing the 0.5 coefficients to be 0.5f0, it would be nice to have it in the main repo to make it available for everybody.

huiyuxie · 2024-08-21T00:32:24Z

It's not work only about 0.5 and 0.5f0. Some meshes are hardcoded to Float64 (like you already mentioned). Are you planning to fix the entire issue yourself or just focusing on the 0.5 and 0.5f0 aspects? I think maybe tests are also needed just like for the equations.

ranocha · 2024-08-21T04:08:58Z

This PR is big enough as it is. I will not work on the other mesh types immediately. Adding tests for complete RHS evaluations etc. is quite difficult as I have described above - even when looking only at something like flux_differencing_kernel!, I don't see any simple way that we use only a certain type of floating point numbers with our current CI setup

andrewwinters5000

Everything looks good as far as the changes go. Just a few questions and one suggestion to simplify the new elixir.

One question looking toward the future, e.g., the ln_mean. For a literal that cannot be exactly represented in Float32, would we change a line like

        return (x + y) / @evalpoly(f2, 2, 2/3, 2/5, 2/7)

to be

        RealT = eltype(x)
        return (x + y) / @evalpoly(f2, 2, convert(RealT, 2/3), convert(RealT, 2/5), convert(RealT, 2/7))

examples/p4est_3d_dgsem/elixir_euler_free_stream_boundaries_float32.jl

test/test_p4est_3d.jl

test/test_unstructured_2d.jl

Co-authored-by: Andrew Winters <[email protected]>

andrewwinters5000

LGTM!

ranocha · 2024-08-21T05:48:33Z

One question looking toward the future, e.g., the ln_mean. For a literal that cannot be exactly represented in Float32, would we change a line like
        return (x + y) / @evalpoly(f2, 2, 2/3, 2/5, 2/7)
to be
        RealT = eltype(x)
        return (x + y) / @evalpoly(f2, 2, convert(RealT, 2/3), convert(RealT, 2/5), convert(RealT, 2/7))

That's a good question. I am not sure how exactly to handle the ln_mean. There are several options:

convert the constants to RealT as you suggested above (with RealT = typeof(x)). However, this means we will also use dual numbers when running AD, which might not be what we actually want. Nevertheless, this is what we use right now in most cases.
Extract the underlying basic real type - not only here but also in other places. This could be something along the lines of typeof(ForwardDiff.value(real(x))) - which should get a better name if we want to use this pattern.
Add new methods dispatching on the types of x, y. This would also give us the flexibility to change the underlying approximation in case we find something more efficient but similarly accurate for Float32. However, we have to decide how to handle AD in this case since we will get a problem when running AD in Float32 if we just dispatch on ::Float32.

Currently, I would likely opt for variant 1 but maybe with a slight change since I am not sure whether it's better to use convert(RealT, 2 / 3) or convert(RealT, 2) / 3. A first test does not suggest a difference in Julia 1.10:

julia> function foo1(x, y)
           s = x + y
           return (x + y) / @evalpoly(s, 2, 2 / 3, 2 / 5, 2 / 7)
       end
foo1 (generic function with 1 method)

julia> function foo2(x, y)
           s = x + y
           return (x + y) / @evalpoly(s, 2, 2 / 3, 2 / 5, 2 / 7)
       end
foo2 (generic function with 1 method)

julia> function foo2(x, y)
           s = x + y
           RealT = typeof(s)
           return (x + y) / @evalpoly(s, 2, convert(RealT, 2 / 3), 
                                            convert(RealT, 2 / 5), 
                                            convert(RealT, 2 / 7))
       end
foo2 (generic function with 1 method)

julia> function foo3(x, y)
           s = x + y
           RealT = typeof(s)
           return (x + y) / @evalpoly(s, 2, convert(RealT, 2) / 3, 
                                            convert(RealT, 2) / 5, 
                                            convert(RealT, 2) / 7)
       end
foo3 (generic function with 1 method)

julia> foo1(1.0, 1.0)
0.2770448548812665

julia> foo2(1.0, 1.0)
0.2770448548812665

julia> foo3(1.0, 1.0)
0.2770448548812665

julia> foo1(1.0f0, 1.0f0)
0.2770448548812665

julia> foo2(1.0f0, 1.0f0)
0.27704483f0

julia> foo3(1.0f0, 1.0f0)
0.27704483f0

julia> @code_llvm foo1(1.0, 1.0)
;  @ REPL[9]:1 within `foo1`
define double @julia_foo1_637(double %0, double %1) #0 {
top:
;  @ REPL[9]:2 within `foo1`
; ┌ @ float.jl:409 within `+`
   %2 = fadd double %0, %1
; └
;  @ REPL[9]:3 within `foo1`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract double %2, 0x3FD2492492492492
     %4 = fadd contract double %3, 4.000000e-01
     %5 = fmul contract double %2, %4
     %6 = fadd contract double %5, 0x3FE5555555555555
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract double %2, %6
     %8 = fadd contract double %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv double %2, %8
; └
  ret double %9
}

julia> @code_llvm foo2(1.0, 1.0)
;  @ REPL[11]:1 within `foo2`
define double @julia_foo2_648(double %0, double %1) #0 {
top:
;  @ REPL[11]:2 within `foo2`
; ┌ @ float.jl:409 within `+`
   %2 = fadd double %0, %1
; └
;  @ REPL[11]:4 within `foo2`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract double %2, 0x3FD2492492492492
     %4 = fadd contract double %3, 4.000000e-01
     %5 = fmul contract double %2, %4
     %6 = fadd contract double %5, 0x3FE5555555555555
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract double %2, %6
     %8 = fadd contract double %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv double %2, %8
; └
  ret double %9
}

julia> @code_llvm foo3(1.0, 1.0)
;  @ REPL[12]:1 within `foo3`
define double @julia_foo3_650(double %0, double %1) #0 {
top:
;  @ REPL[12]:2 within `foo3`
; ┌ @ float.jl:409 within `+`
   %2 = fadd double %0, %1
; └
;  @ REPL[12]:4 within `foo3`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract double %2, 0x3FD2492492492492
     %4 = fadd contract double %3, 4.000000e-01
     %5 = fmul contract double %2, %4
     %6 = fadd contract double %5, 0x3FE5555555555555
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract double %2, %6
     %8 = fadd contract double %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv double %2, %8
; └
  ret double %9
}

julia> @code_llvm foo1(1.0f0, 1.0f0)
;  @ REPL[9]:1 within `foo1`
define double @julia_foo1_652(float %0, float %1) #0 {
top:
;  @ REPL[9]:2 within `foo1`
; ┌ @ float.jl:409 within `+`
   %2 = fadd float %0, %1
; └
;  @ REPL[9]:3 within `foo1`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ promotion.jl:459 within `muladd`
; │││┌ @ promotion.jl:399 within `promote`
; ││││┌ @ promotion.jl:377 within `_promote`
; │││││┌ @ number.jl:7 within `convert`
; ││││││┌ @ float.jl:261 within `Float64`
         %3 = fpext float %2 to double
; │││└└└└
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %4 = fmul contract double %3, 0x3FD2492492492492
     %5 = fadd contract double %4, 4.000000e-01
     %6 = fmul contract double %5, %3
     %7 = fadd contract double %6, 0x3FE5555555555555
     %8 = fmul contract double %7, %3
     %9 = fadd contract double %8, 2.000000e+00
; └└└
; ┌ @ promotion.jl:425 within `/` @ float.jl:412
   %10 = fdiv double %3, %9
; └
  ret double %10
}

julia> @code_llvm foo2(1.0f0, 1.0f0)
;  @ REPL[11]:1 within `foo2`
define float @julia_foo2_654(float %0, float %1) #0 {
top:
;  @ REPL[11]:2 within `foo2`
; ┌ @ float.jl:409 within `+`
   %2 = fadd float %0, %1
; └
;  @ REPL[11]:4 within `foo2`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract float %2, 0x3FD24924A0000000
     %4 = fadd contract float %3, 0x3FD99999A0000000
     %5 = fmul contract float %2, %4
     %6 = fadd contract float %5, 0x3FE5555560000000
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract float %2, %6
     %8 = fadd contract float %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv float %2, %8
; └
  ret float %9
}

julia> @code_llvm foo3(1.0f0, 1.0f0)
;  @ REPL[12]:1 within `foo3`
define float @julia_foo3_656(float %0, float %1) #0 {
top:
;  @ REPL[12]:2 within `foo3`
; ┌ @ float.jl:409 within `+`
   %2 = fadd float %0, %1
; └
;  @ REPL[12]:4 within `foo3`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract float %2, 0x3FD24924A0000000
     %4 = fadd contract float %3, 0x3FD99999A0000000
     %5 = fmul contract float %2, %4
     %6 = fadd contract float %5, 0x3FE5555560000000
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract float %2, %6
     %8 = fadd contract float %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv float %2, %8
; └
  ret float %9
}

andrewwinters5000 · 2024-08-21T06:21:08Z

Currently, I would likely opt for variant 1 but maybe with a slight change since I am not sure whether it's better to use convert(RealT, 2 / 3) or convert(RealT, 2) / 3. A first test does not suggest a difference in Julia 1.10:

I would also lean toward option 1 as it touches the least amount of code and (hopefully) does not cause too many headaches with AD. Just to double check that I understand the type inheritance:

Computing convert(RealT, 2 / 3) we have Int / Int results in Float64 and then converts to RealT
Computing convert(RealT, 2) / 3 converts Int to RealT and then the division is RealT / Int which inherits the "most accurate" type, i.e., RealT

ranocha · 2024-08-21T07:50:58Z

Currently, I would likely opt for variant 1 but maybe with a slight change since I am not sure whether it's better to use convert(RealT, 2 / 3) or convert(RealT, 2) / 3. A first test does not suggest a difference in Julia 1.10:

I would also lean toward option 1 as it touches the least amount of code and (hopefully) does not cause too many headaches with AD. Just to double check that I understand the type inheritance:

Computing convert(RealT, 2 / 3) we have Int / Int results in Float64 and then converts to RealT

Computing convert(RealT, 2) / 3 converts Int to RealT and then the division is RealT / Int which inherits the "most accurate" type, i.e., RealT

Yes, exactly 👍

add example using only Float32

089b726

ranocha added 2 commits August 16, 2024 06:59

Merge branch 'main' into hr/float32

ddd0610

make RHS work for examples/unstructured_2d_dgsem/elixir_shallowwater_…

488ae59

…ec_float32.jl

ranocha added 4 commits August 16, 2024 07:58

more fixes

46badef

format

00f6a46

more fixes

5ecb581

examples/p4est_3d_dgsem/elixir_euler_free_stream_boundaries_float32.jl

069b037

ranocha added 4 commits August 16, 2024 14:54

add tests

d2fef77

upwind FDSBP

2ebaba2

format

73eedb7

format

7df3f2d

ranocha changed the title ~~WIP: computations using only Float32 and no Float64~~ computations using only Float32 and no Float64 Aug 16, 2024

ranocha marked this pull request as ready for review August 16, 2024 13:20

ranocha requested a review from andrewwinters5000 August 16, 2024 13:20

fix reference values

5e39f90

Merge branch 'main' into hr/float32

78eb6a6

ranocha and others added 4 commits August 20, 2024 13:41

adapt SWE EC test value after PR #2038

8dc941f

Merge branch 'main' into hr/float32

881c886

more mesh fixes

c9f0064

more fixes

14f87f9

andrewwinters5000 requested changes Aug 21, 2024

View reviewed changes

ranocha and others added 4 commits August 21, 2024 07:27

Apply suggestions from code review

15ed60c

Co-authored-by: Andrew Winters <[email protected]>

remove redundant using Downloads in examples

8a8ce7c

more comments

7b317b0

format

6dee45a

ranocha requested a review from andrewwinters5000 August 21, 2024 05:36

andrewwinters5000 approved these changes Aug 21, 2024

View reviewed changes

ranocha merged commit 20ab97b into main Aug 21, 2024
38 checks passed

ranocha deleted the hr/float32 branch August 21, 2024 07:51

ranocha mentioned this pull request Aug 21, 2024

adapt auxiliary math functions for Float32 #2048

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

computations using only `Float32` and no `Float64` #2042

computations using only `Float32` and no `Float64` #2042

ranocha commented Aug 15, 2024 •

edited

Loading

github-actions bot commented Aug 15, 2024

codecov bot commented Aug 15, 2024 •

edited

Loading

huiyuxie commented Aug 16, 2024

ranocha commented Aug 16, 2024

ranocha commented Aug 16, 2024

ranocha commented Aug 16, 2024

ranocha commented Aug 16, 2024

huiyuxie commented Aug 17, 2024

ranocha commented Aug 17, 2024

ranocha commented Aug 17, 2024

ranocha commented Aug 17, 2024

ranocha commented Aug 17, 2024

huiyuxie commented Aug 19, 2024

ranocha commented Aug 20, 2024

huiyuxie commented Aug 21, 2024

ranocha commented Aug 21, 2024

andrewwinters5000 left a comment

andrewwinters5000 left a comment

ranocha commented Aug 21, 2024

andrewwinters5000 commented Aug 21, 2024 •

edited

Loading

ranocha commented Aug 21, 2024

computations using only Float32 and no Float64 #2042

computations using only Float32 and no Float64 #2042

Conversation

ranocha commented Aug 15, 2024 • edited Loading

TODO for this PR

Open questions for discussion

Open questions that will not be resolved here

github-actions bot commented Aug 15, 2024

Review checklist

Purpose and scope

Code quality

Documentation

Testing

Performance

Verification

codecov bot commented Aug 15, 2024 • edited Loading

Codecov Report

huiyuxie commented Aug 16, 2024

ranocha commented Aug 16, 2024

ranocha commented Aug 16, 2024

ranocha commented Aug 16, 2024

ranocha commented Aug 16, 2024

huiyuxie commented Aug 17, 2024

ranocha commented Aug 17, 2024

ranocha commented Aug 17, 2024

ranocha commented Aug 17, 2024

ranocha commented Aug 17, 2024

huiyuxie commented Aug 19, 2024

ranocha commented Aug 20, 2024

huiyuxie commented Aug 21, 2024

ranocha commented Aug 21, 2024

andrewwinters5000 left a comment

Choose a reason for hiding this comment

andrewwinters5000 left a comment

Choose a reason for hiding this comment

ranocha commented Aug 21, 2024

andrewwinters5000 commented Aug 21, 2024 • edited Loading

ranocha commented Aug 21, 2024

computations using only `Float32` and no `Float64` #2042

computations using only `Float32` and no `Float64` #2042

ranocha commented Aug 15, 2024 •

edited

Loading

codecov bot commented Aug 15, 2024 •

edited

Loading

andrewwinters5000 commented Aug 21, 2024 •

edited

Loading