Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

computations using only Float32 and no Float64 #2042

Merged
merged 21 commits into from
Aug 21, 2024
Merged

computations using only Float32 and no Float64 #2042

merged 21 commits into from
Aug 21, 2024

Conversation

ranocha
Copy link
Member

@ranocha ranocha commented Aug 15, 2024

Due to the great work of @huiyuxie, we're on a good way to support running computations only in Float32. However, checking that we're indeed using only Float32 and no Float64 everywhere is not trivial as far as I know. This may also be interesting for @patrickersing.

The best option that comes to my mind is using LIKWID.jl. Installing it (Linux only!) and using the additional patch

--- a/src/auxiliary/auxiliary.jl
+++ b/src/auxiliary/auxiliary.jl
@@ -21,7 +21,7 @@ runtime of all measurements added so far via `take!(counter)`, resetting the
 """
 mutable struct PerformanceCounter
     ncalls_since_readout::Int
-    runtime::Float64
+    runtime::Float32
 end

I get the following results:

julia> using LIKWID, Revise; using Trixi
Precompiling Trixi

julia> redirect_stdout(devnull) do 
           trixi_include("examples/structured_2d_dgsem/elixir_advection_float32.jl")
       end

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode);

julia> function doit!(du_ode, u_ode, semi, n)
           for _ in 1:n
               Trixi.rhs!(du_ode, u_ode, semi, zero(eltype(u_ode)));
           end
       end

julia> doit!(du_ode, u_ode, semi, 1) # compile

julia> @perfmon "FLOPS_DP" doit!(du_ode, u_ode, semi, 10^4);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.0181e10 │
│                    CPU_CLK_UNHALTED_CORE │ 3.17159e9 │
│                     CPU_CLK_UNHALTED_REF │  2.5141e9 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬──────────┐
│               Metric │ Thread 1 │
├──────────────────────┼──────────┤
│  Runtime (RDTSC) [s] │  0.71929 │
│ Runtime unhalted [s] │ 0.858109 │
│          Clock [MHz] │  4662.61 │
│                  CPI │  0.31152 │
│         DP [MFLOP/s] │      0.0 │
│     AVX DP [MFLOP/s] │      0.0 │
│     Packed [MUOPS/s] │      0.0 │
│     Scalar [MUOPS/s] │      0.0 │
│  Vectorization ratio │      NaN │
└──────────────────────┴──────────┘

This confirms that our Trixi.rhs! evaluation does indeed not use any Float64 computations in this case 🥳

For completeness, I also get

julia> @perfmon "FLOPS_SP" doit!(du_ode, u_ode, semi, 10^4);

Group: FLOPS_SP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.0181e10 │
│                    CPU_CLK_UNHALTED_CORE │ 3.19003e9 │
│                     CPU_CLK_UNHALTED_REF │ 2.52266e9 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_SINGLE │ 1.41313e9 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬──────────┐
│               Metric │ Thread 1 │
├──────────────────────┼──────────┤
│  Runtime (RDTSC) [s] │ 0.719212 │
│ Runtime unhalted [s] │ 0.863098 │
│          Clock [MHz] │   4673.8 │
│                  CPI │ 0.313331 │
│         SP [MFLOP/s] │  1964.83 │
│     AVX SP [MFLOP/s] │      0.0 │
│     Packed [MUOPS/s] │      0.0 │
│     Scalar [MUOPS/s] │  1964.83 │
│  Vectorization ratio │      0.0 │
└──────────────────────┴──────────┘

Solving without callbacks is also fine (using the time step size roughly given by the stepsize_callback):

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.03995e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.28811e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.59127e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       2.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │  0.00754828 │
│ Runtime unhalted [s] │  0.00889635 │
│          Clock [MHz] │     4689.96 │
│                  CPI │    0.316179 │
│         DP [MFLOP/s] │ 0.000264961 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │ 0.000264961 │
│  Vectorization ratio │         0.0 │
└──────────────────────┴─────────────┘

julia> @perfmon "FLOPS_SP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false);

Group: FLOPS_SP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.03995e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.27587e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.58112e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_SINGLE │    6608.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_SINGLE │ 1.42731e7 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_SINGLE │  270336.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│               Metric │   Thread 1 │
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │ 0.00739226 │
│ Runtime unhalted [s] │ 0.00886322 │
│          Clock [MHz] │    4690.87 │
│                  CPI │   0.315001 │
│         SP [MFLOP/s] │    2226.95 │
│     AVX SP [MFLOP/s] │    292.561 │
│     Packed [MUOPS/s] │     37.464 │
│     Scalar [MUOPS/s] │    1930.82 │
│  Vectorization ratio │    1.90339 │
└──────────────────────┴────────────┘

The StepsizeCallback is also fine:

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false, callback = stepsize_callback);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 9.60974e7 │
│                    CPU_CLK_UNHALTED_CORE │ 3.03774e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.33938e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│               Metric │   Thread 1 │
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │ 0.00644704 │
│ Runtime unhalted [s] │ 0.00821915 │
│          Clock [MHz] │    4799.25 │
│                  CPI │   0.316111 │
│         DP [MFLOP/s] │        0.0 │
│     AVX DP [MFLOP/s] │        0.0 │
│     Packed [MUOPS/s] │        0.0 │
│     Scalar [MUOPS/s] │        0.0 │
│  Vectorization ratio │        NaN │
└──────────────────────┴────────────┘

However, the SummaryCallback uses a tiny amount of Float64 computations:

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false, callback = summary_callback);
[...]
Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.05099e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.42389e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.69335e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       4.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │  0.00779737 │
│ Runtime unhalted [s] │  0.00926393 │
│          Clock [MHz] │      4698.4 │
│                  CPI │    0.325776 │
│         DP [MFLOP/s] │ 0.000512994 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │ 0.000512994 │
│  Vectorization ratio │         0.0 │
└──────────────────────┴─────────────┘

The AnalysisCallback uses quite some Float64 computations:

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false, callback = analysis_callback);
[...]
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.10713e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.59729e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.83354e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      11.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │  385295.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│               Metric │   Thread 1 │
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │  0.0085065 │
│ Runtime unhalted [s] │ 0.00973311 │
│          Clock [MHz] │    4692.13 │
│                  CPI │   0.324922 │
│         DP [MFLOP/s] │    45.2968 │
│     AVX DP [MFLOP/s] │        0.0 │
│     Packed [MUOPS/s] │ 0.00129313 │
│     Scalar [MUOPS/s] │    45.2942 │
│  Vectorization ratio │ 0.00285487 │
└──────────────────────┴────────────┘

The SaveSolutionCallback uses a tiny amount of Float64 computations, too:

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false, callback = save_solution);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 1.05923e8 │
│                    CPU_CLK_UNHALTED_CORE │ 3.77849e7 │
│                     CPU_CLK_UNHALTED_REF │ 2.97858e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       8.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │  0.00988881 │
│ Runtime unhalted [s] │   0.0102234 │
│          Clock [MHz] │      4688.5 │
│                  CPI │    0.356721 │
│         DP [MFLOP/s] │ 0.000808995 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │ 0.000808995 │
│  Vectorization ratio │         0.0 │
└──────────────────────┴─────────────┘

TODO for this PR

Open questions for discussion

  • Shall we have the option to use only Float32 for stuff like the PerformanceCounter?
  • What about callbacks?

Open questions that will not be resolved here

  • Installing LIKWID is quite tedious. I don't think it's an option with our current CI system. Thus, the nice tests can only be done manually right now.
  • How many examples shall we provide?
  • What shall we test?

Copy link
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

Copy link

codecov bot commented Aug 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.30%. Comparing base (afd8060) to head (6dee45a).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2042   +/-   ##
=======================================
  Coverage   96.30%   96.30%           
=======================================
  Files         466      467    +1     
  Lines       37227    37256   +29     
=======================================
+ Hits        35850    35879   +29     
  Misses       1377     1377           
Flag Coverage Δ
unittests 96.30% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@huiyuxie
Copy link
Member

Thanks!!

I'm not sure whether using Float32 for mesh and callbacks is meaningful here, but for me, it's important as it would be good to ensure that things can be accelerated on the GPU as much as possible. I will enhance them in the future regardless of whether it's needed here or not.

I have a question about the original purpose of adding Float32 to this repository. I can see that adding higher precision numbers (like Float128) helps achieve more precise results, but in the case of Float32, there doesn't seem to be much speed gain on CPU. I can't see any goods except providing support for GPU execution. @ranocha

@ranocha
Copy link
Member Author

ranocha commented Aug 16, 2024

have a question about the original purpose of adding Float32 to this repository. I can see that adding higher precision numbers (like Float128) helps achieve more precise results, but in the case of Float32, there doesn't seem to be much speed gain on CPU. I can't see any goods except providing support for GPU execution.

That's indeed the main motivation

@ranocha
Copy link
Member Author

ranocha commented Aug 16, 2024

julia> redirect_stdout(devnull) do 
           trixi_include("examples/unstructured_2d_dgsem/elixir_shallowwater_ec_float32.jl")
       end
[ Info: You just called `trixi_include`. Julia may now compile the code, please be patient.

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode);

julia> @perfmon "FLOPS_DP" Trixi.rhs!(du_ode, u_ode, semi, 0.0f0);

Group: FLOPS_DP
┌──────────────────────────────────────────┬──────────┐
│                                    Event │ Thread 1 │
├──────────────────────────────────────────┼──────────┤
│                        INSTR_RETIRED_ANY │ 868636.0 │
│                    CPU_CLK_UNHALTED_CORE │ 373225.0 │
│                     CPU_CLK_UNHALTED_REF │ 287364.0 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │      0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │      0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │      0.0 │
└──────────────────────────────────────────┴──────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │  9.39318e-5 │
│ Runtime unhalted [s] │ 0.000100979 │
│          Clock [MHz] │     4800.41 │
│                  CPI │    0.429668 │
│         DP [MFLOP/s] │         0.0 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │         0.0 │
│  Vectorization ratio │         NaN │
└──────────────────────┴─────────────┘

@ranocha
Copy link
Member Author

ranocha commented Aug 16, 2024

julia> trixi_include("examples/p4est_3d_dgsem/elixir_euler_free_stream_boundaries_float32.jl")

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode);

julia> @perfmon "FLOPS_DP" Trixi.rhs!(du_ode, u_ode, semi, 0.0f0);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 4.45956e6 │
│                    CPU_CLK_UNHALTED_CORE │ 1.91896e6 │
│                     CPU_CLK_UNHALTED_REF │  1.5092e6 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬─────────────┐
│               Metric │    Thread 1 │
├──────────────────────┼─────────────┤
│  Runtime (RDTSC) [s] │ 0.000478556 │
│ Runtime unhalted [s] │ 0.000519195 │
│          Clock [MHz] │     4699.53 │
│                  CPI │    0.430302 │
│         DP [MFLOP/s] │         0.0 │
│     AVX DP [MFLOP/s] │         0.0 │
│     Packed [MUOPS/s] │         0.0 │
│     Scalar [MUOPS/s] │         0.0 │
│  Vectorization ratio │         NaN │
└──────────────────────┴─────────────┘

@ranocha ranocha changed the title WIP: computations using only Float32 and no Float64 computations using only Float32 and no Float64 Aug 16, 2024
@ranocha
Copy link
Member Author

ranocha commented Aug 16, 2024

FDSBP upwind works, too:

julia> trixi_include("examples/unstructured_2d_fdsbp/elixir_euler_free_stream_upwind_float32.jl", tspan = (0.0f0, 0.1f0))

julia> u_ode = copy(sol.u[end]); du_ode = similar(u_ode);

julia> @perfmon "FLOPS_DP" Trixi.rhs!(du_ode, u_ode, semi, 0.0f0);

Group: FLOPS_DP
┌──────────────────────────────────────────┬───────────┐
│                                    Event │  Thread 1 │
├──────────────────────────────────────────┼───────────┤
│                        INSTR_RETIRED_ANY │ 2.59028e7 │
│                    CPU_CLK_UNHALTED_CORE │ 1.44735e7 │
│                     CPU_CLK_UNHALTED_REF │  1.1153e7 │
│ FP_ARITH_INST_RETIRED_128B_PACKED_DOUBLE │       0.0 │
│      FP_ARITH_INST_RETIRED_SCALAR_DOUBLE │       0.0 │
│ FP_ARITH_INST_RETIRED_256B_PACKED_DOUBLE │       0.0 │
└──────────────────────────────────────────┴───────────┘
┌──────────────────────┬────────────┐
│               Metric │   Thread 1 │
├──────────────────────┼────────────┤
│  Runtime (RDTSC) [s] │  0.0033605 │
│ Runtime unhalted [s] │ 0.00391595 │
│          Clock [MHz] │    4796.45 │
│                  CPI │   0.558763 │
│         DP [MFLOP/s] │        0.0 │
│     AVX DP [MFLOP/s] │        0.0 │
│     Packed [MUOPS/s] │        0.0 │
│     Scalar [MUOPS/s] │        0.0 │
│  Vectorization ratio │        NaN │
└──────────────────────┴────────────┘

@ranocha ranocha marked this pull request as ready for review August 16, 2024 13:20
@huiyuxie
Copy link
Member

Solving without callbacks is also fine

julia> @perfmon "FLOPS_DP" solve(ode, CarpenterKennedy2N54(williamson_condition = false),
                                 dt = 5.0f-2, save_everystep = false);

But the result shows that Float64 operations actually occurred during the process. Could you please check again and see where they come from?

Also, you are all referring to type stability during the call of solve, but I also care about type stability during mesh initialization (and callback initialization if applicable). For example, there are much space for speedup for StructuredMesh if you look into https://github.com/trixi-framework/Trixi.jl/blob/main/src/meshes/structured_mesh.jl (even small speedup is also meaningful for me). Could you please update the description here and xref to clarify that it only refers to type stability during the solving process? This is different from general type stability, as we also have some hardcoded Float64 for StructuredMesh like here

# Interpolate linearly between left and right value where s should be between -1 and 1
function linear_interpolate(s, left_value, right_value)
0.5 * ((1 - s) * left_value + (1 + s) * right_value)
end

What I plan to do is to perform mesh initialization on the GPU and keep the data on the GPU to avoid the overhead of data transfer between the CPU and GPU. I can handle the type stability during initialization myself. @ranocha

@ranocha
Copy link
Member Author

ranocha commented Aug 17, 2024

But the result shows that Float64 operations actually occurred during the process. Could you please check again and see where they come from?

I need to get access to the system again but it may have been that I was running the code without the PerformanceCounter fix (since I accidentally used git commit -a and then needed to revert it again, maybe I forgot something 🤷). But it's just a tiny amount:

julia> 0.000264961 * 0.00754828 # MFLOP/s * time in seconds
1.99999981708e-6

So that's like 2 FLOPs.

@ranocha
Copy link
Member Author

ranocha commented Aug 17, 2024

Also, you are all referring to type stability during the call of solve,

It's even less than that - what I have measured here is whether there are Float64 computations in rhs! or solve. There could still be type instabilities somewhere, but that's not accounted for, and it only tested implicitly with our allocation tests.

@ranocha
Copy link
Member Author

ranocha commented Aug 17, 2024

I also care about type stability during mesh initialization (and callback initialization if applicable).

That's a good point 👍

For example, there are much space for speedup for StructuredMesh if you look into https://github.com/trixi-framework/Trixi.jl/blob/main/src/meshes/structured_mesh.jl (even small speedup is also meaningful for me). Could you please update the description here and xref to clarify that it only refers to type stability during the solving process?

Done

@ranocha
Copy link
Member Author

ranocha commented Aug 17, 2024

What I plan to do is to perform mesh initialization on the GPU and keep the data on the GPU to avoid the overhead of data transfer between the CPU and GPU. I can handle the type stability during initialization myself.

Do you plan to do this in your repo or also with PRs to this repo, e.g., to fix the problem you pointed at above?

@huiyuxie
Copy link
Member

So that's like 2 FLOPs.

That's fair. Then for SummaryCallback we get around 4 FLOPs.

julia> 0.000512994 * 0.00779737
4.00000402578e-6

And for SaveSolutionCallback we get around 8 FLOPs.

julia> 0.000808995 * 0.00988881
7.99999784595e-6

It does not look reasonable for me to conclude SummaryCallback and SaveSolutionCallback are not fine but saying solving without callbacks is fine as I cannot see big difference between 2 FLOPs, 4 FLOPs, and 8 FLOPs.

Installing LIKWID is quite tedious. I don't think it's an option with our current CI system. Thus, the nice tests can only be done manually right now.

I'm more than curious about this question. Do you have any good ideas (or see from other repos) for designing thorough tests for it, except for the memory allocation tests you are currently using?

Do you plan to do this in your repo or also with PRs to this repo, e.g., to fix the problem you pointed at above?

Certainly. But I'm not sure whether to implement it in my repo or in this repo. Could you see any benefits in making Float32 work for mesh initialization in your repo? If so, I can start by trying the improvement in your repo, and then see how it could be better implemented in my repo.

@ranocha

@ranocha
Copy link
Member Author

ranocha commented Aug 20, 2024

Installing LIKWID is quite tedious. I don't think it's an option with our current CI system. Thus, the nice tests can only be done manually right now.

I'm more than curious about this question. Do you have any good ideas (or see from other repos) for designing thorough tests for it, except for the memory allocation tests you are currently using?

No

Do you plan to do this in your repo or also with PRs to this repo, e.g., to fix the problem you pointed at above?

Certainly. But I'm not sure whether to implement it in my repo or in this repo. Could you see any benefits in making Float32 work for mesh initialization in your repo? If so, I can start by trying the improvement in your repo, and then see how it could be better implemented in my repo.

Well, if it's just fixing the 0.5 coefficients to be 0.5f0, it would be nice to have it in the main repo to make it available for everybody.

@huiyuxie
Copy link
Member

It's not work only about 0.5 and 0.5f0. Some meshes are hardcoded to Float64 (like you already mentioned). Are you planning to fix the entire issue yourself or just focusing on the 0.5 and 0.5f0 aspects? I think maybe tests are also needed just like for the equations.

@ranocha
Copy link
Member Author

ranocha commented Aug 21, 2024

This PR is big enough as it is. I will not work on the other mesh types immediately. Adding tests for complete RHS evaluations etc. is quite difficult as I have described above - even when looking only at something like flux_differencing_kernel!, I don't see any simple way that we use only a certain type of floating point numbers with our current CI setup

Copy link
Member

@andrewwinters5000 andrewwinters5000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good as far as the changes go. Just a few questions and one suggestion to simplify the new elixir.

One question looking toward the future, e.g., the ln_mean. For a literal that cannot be exactly represented in Float32, would we change a line like

        return (x + y) / @evalpoly(f2, 2, 2/3, 2/5, 2/7)

to be

        RealT = eltype(x)
        return (x + y) / @evalpoly(f2, 2, convert(RealT, 2/3), convert(RealT, 2/5), convert(RealT, 2/7))

test/test_p4est_3d.jl Show resolved Hide resolved
test/test_unstructured_2d.jl Show resolved Hide resolved
Copy link
Member

@andrewwinters5000 andrewwinters5000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ranocha
Copy link
Member Author

ranocha commented Aug 21, 2024

One question looking toward the future, e.g., the ln_mean. For a literal that cannot be exactly represented in Float32, would we change a line like

        return (x + y) / @evalpoly(f2, 2, 2/3, 2/5, 2/7)

to be

        RealT = eltype(x)
        return (x + y) / @evalpoly(f2, 2, convert(RealT, 2/3), convert(RealT, 2/5), convert(RealT, 2/7))

That's a good question. I am not sure how exactly to handle the ln_mean. There are several options:

  1. convert the constants to RealT as you suggested above (with RealT = typeof(x)). However, this means we will also use dual numbers when running AD, which might not be what we actually want. Nevertheless, this is what we use right now in most cases.
  2. Extract the underlying basic real type - not only here but also in other places. This could be something along the lines of typeof(ForwardDiff.value(real(x))) - which should get a better name if we want to use this pattern.
  3. Add new methods dispatching on the types of x, y. This would also give us the flexibility to change the underlying approximation in case we find something more efficient but similarly accurate for Float32. However, we have to decide how to handle AD in this case since we will get a problem when running AD in Float32 if we just dispatch on ::Float32.

Currently, I would likely opt for variant 1 but maybe with a slight change since I am not sure whether it's better to use convert(RealT, 2 / 3) or convert(RealT, 2) / 3. A first test does not suggest a difference in Julia 1.10:

julia> function foo1(x, y)
           s = x + y
           return (x + y) / @evalpoly(s, 2, 2 / 3, 2 / 5, 2 / 7)
       end
foo1 (generic function with 1 method)

julia> function foo2(x, y)
           s = x + y
           return (x + y) / @evalpoly(s, 2, 2 / 3, 2 / 5, 2 / 7)
       end
foo2 (generic function with 1 method)

julia> function foo2(x, y)
           s = x + y
           RealT = typeof(s)
           return (x + y) / @evalpoly(s, 2, convert(RealT, 2 / 3), 
                                            convert(RealT, 2 / 5), 
                                            convert(RealT, 2 / 7))
       end
foo2 (generic function with 1 method)

julia> function foo3(x, y)
           s = x + y
           RealT = typeof(s)
           return (x + y) / @evalpoly(s, 2, convert(RealT, 2) / 3, 
                                            convert(RealT, 2) / 5, 
                                            convert(RealT, 2) / 7)
       end
foo3 (generic function with 1 method)

julia> foo1(1.0, 1.0)
0.2770448548812665

julia> foo2(1.0, 1.0)
0.2770448548812665

julia> foo3(1.0, 1.0)
0.2770448548812665

julia> foo1(1.0f0, 1.0f0)
0.2770448548812665

julia> foo2(1.0f0, 1.0f0)
0.27704483f0

julia> foo3(1.0f0, 1.0f0)
0.27704483f0

julia> @code_llvm foo1(1.0, 1.0)
;  @ REPL[9]:1 within `foo1`
define double @julia_foo1_637(double %0, double %1) #0 {
top:
;  @ REPL[9]:2 within `foo1`
; ┌ @ float.jl:409 within `+`
   %2 = fadd double %0, %1
; └
;  @ REPL[9]:3 within `foo1`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract double %2, 0x3FD2492492492492
     %4 = fadd contract double %3, 4.000000e-01
     %5 = fmul contract double %2, %4
     %6 = fadd contract double %5, 0x3FE5555555555555
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract double %2, %6
     %8 = fadd contract double %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv double %2, %8
; └
  ret double %9
}

julia> @code_llvm foo2(1.0, 1.0)
;  @ REPL[11]:1 within `foo2`
define double @julia_foo2_648(double %0, double %1) #0 {
top:
;  @ REPL[11]:2 within `foo2`
; ┌ @ float.jl:409 within `+`
   %2 = fadd double %0, %1
; └
;  @ REPL[11]:4 within `foo2`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract double %2, 0x3FD2492492492492
     %4 = fadd contract double %3, 4.000000e-01
     %5 = fmul contract double %2, %4
     %6 = fadd contract double %5, 0x3FE5555555555555
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract double %2, %6
     %8 = fadd contract double %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv double %2, %8
; └
  ret double %9
}

julia> @code_llvm foo3(1.0, 1.0)
;  @ REPL[12]:1 within `foo3`
define double @julia_foo3_650(double %0, double %1) #0 {
top:
;  @ REPL[12]:2 within `foo3`
; ┌ @ float.jl:409 within `+`
   %2 = fadd double %0, %1
; └
;  @ REPL[12]:4 within `foo3`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract double %2, 0x3FD2492492492492
     %4 = fadd contract double %3, 4.000000e-01
     %5 = fmul contract double %2, %4
     %6 = fadd contract double %5, 0x3FE5555555555555
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract double %2, %6
     %8 = fadd contract double %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv double %2, %8
; └
  ret double %9
}

julia> @code_llvm foo1(1.0f0, 1.0f0)
;  @ REPL[9]:1 within `foo1`
define double @julia_foo1_652(float %0, float %1) #0 {
top:
;  @ REPL[9]:2 within `foo1`
; ┌ @ float.jl:409 within `+`
   %2 = fadd float %0, %1
; └
;  @ REPL[9]:3 within `foo1`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ promotion.jl:459 within `muladd`
; │││┌ @ promotion.jl:399 within `promote`
; ││││┌ @ promotion.jl:377 within `_promote`
; │││││┌ @ number.jl:7 within `convert`
; ││││││┌ @ float.jl:261 within `Float64`
         %3 = fpext float %2 to double
; │││└└└└
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %4 = fmul contract double %3, 0x3FD2492492492492
     %5 = fadd contract double %4, 4.000000e-01
     %6 = fmul contract double %5, %3
     %7 = fadd contract double %6, 0x3FE5555555555555
     %8 = fmul contract double %7, %3
     %9 = fadd contract double %8, 2.000000e+00
; └└└
; ┌ @ promotion.jl:425 within `/` @ float.jl:412
   %10 = fdiv double %3, %9
; └
  ret double %10
}

julia> @code_llvm foo2(1.0f0, 1.0f0)
;  @ REPL[11]:1 within `foo2`
define float @julia_foo2_654(float %0, float %1) #0 {
top:
;  @ REPL[11]:2 within `foo2`
; ┌ @ float.jl:409 within `+`
   %2 = fadd float %0, %1
; └
;  @ REPL[11]:4 within `foo2`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract float %2, 0x3FD24924A0000000
     %4 = fadd contract float %3, 0x3FD99999A0000000
     %5 = fmul contract float %2, %4
     %6 = fadd contract float %5, 0x3FE5555560000000
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract float %2, %6
     %8 = fadd contract float %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv float %2, %8
; └
  ret float %9
}

julia> @code_llvm foo3(1.0f0, 1.0f0)
;  @ REPL[12]:1 within `foo3`
define float @julia_foo3_656(float %0, float %1) #0 {
top:
;  @ REPL[12]:2 within `foo3`
; ┌ @ float.jl:409 within `+`
   %2 = fadd float %0, %1
; └
;  @ REPL[12]:4 within `foo3`
; ┌ @ math.jl:186 within `evalpoly`
; │┌ @ math.jl:187 within `macro expansion`
; ││┌ @ float.jl:414 within `muladd`
     %3 = fmul contract float %2, 0x3FD24924A0000000
     %4 = fadd contract float %3, 0x3FD99999A0000000
     %5 = fmul contract float %2, %4
     %6 = fadd contract float %5, 0x3FE5555560000000
; │││ @ promotion.jl:459 within `muladd` @ float.jl:414
     %7 = fmul contract float %2, %6
     %8 = fadd contract float %7, 2.000000e+00
; └└└
; ┌ @ float.jl:412 within `/`
   %9 = fdiv float %2, %8
; └
  ret float %9
}

@andrewwinters5000
Copy link
Member

andrewwinters5000 commented Aug 21, 2024

Currently, I would likely opt for variant 1 but maybe with a slight change since I am not sure whether it's better to use convert(RealT, 2 / 3) or convert(RealT, 2) / 3. A first test does not suggest a difference in Julia 1.10:

I would also lean toward option 1 as it touches the least amount of code and (hopefully) does not cause too many headaches with AD. Just to double check that I understand the type inheritance:

  • Computing convert(RealT, 2 / 3) we have Int / Int results in Float64 and then converts to RealT
  • Computing convert(RealT, 2) / 3 converts Int to RealT and then the division is RealT / Int which inherits the "most accurate" type, i.e., RealT

@ranocha
Copy link
Member Author

ranocha commented Aug 21, 2024

Currently, I would likely opt for variant 1 but maybe with a slight change since I am not sure whether it's better to use convert(RealT, 2 / 3) or convert(RealT, 2) / 3. A first test does not suggest a difference in Julia 1.10:

I would also lean toward option 1 as it touches the least amount of code and (hopefully) does not cause too many headaches with AD. Just to double check that I understand the type inheritance:

  • Computing convert(RealT, 2 / 3) we have Int / Int results in Float64 and then converts to RealT
  • Computing convert(RealT, 2) / 3 converts Int to RealT and then the division is RealT / Int which inherits the "most accurate" type, i.e., RealT

Yes, exactly 👍

@ranocha ranocha merged commit 20ab97b into main Aug 21, 2024
38 checks passed
@ranocha ranocha deleted the hr/float32 branch August 21, 2024 07:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants