Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Checkpointing for T8codeMesh #1980

Draft
wants to merge 52 commits into
base: main
Choose a base branch
from

Conversation

jmark
Copy link
Contributor

@jmark jmark commented Jun 13, 2024

This PR adds checkpointing for T8codeMesh. By this, routines like save_mesh and load_mesh are supported.

Closes #2044

@jmark jmark added the enhancement New feature or request label Jun 13, 2024
Copy link
Contributor

Review checklist

This checklist is meant to assist creators of PRs (to let them know what reviewers will typically look for) and reviewers (to guide them in a structured review process). Items do not need to be checked explicitly for a PR to be eligible for merging.

Purpose and scope

  • The PR has a single goal that is clear from the PR title and/or description.
  • All code changes represent a single set of modifications that logically belong together.
  • No more than 500 lines of code are changed or there is no obvious way to split the PR into multiple PRs.

Code quality

  • The code can be understood easily.
  • Newly introduced names for variables etc. are self-descriptive and consistent with existing naming conventions.
  • There are no redundancies that can be removed by simple modularization/refactoring.
  • There are no leftover debug statements or commented code sections.
  • The code adheres to our conventions and style guide, and to the Julia guidelines.

Documentation

  • New functions and types are documented with a docstring or top-level comment.
  • Relevant publications are referenced in docstrings (see example for formatting).
  • Inline comments are used to document longer or unusual code sections.
  • Comments describe intent ("why?") and not just functionality ("what?").
  • If the PR introduces a significant change or new feature, it is documented in NEWS.md with its PR number.

Testing

  • The PR passes all tests.
  • New or modified lines of code are covered by tests.
  • New or modified tests run in less then 10 seconds.

Performance

  • There are no type instabilities or memory allocations in performance-critical parts.
  • If the PR intent is to improve performance, before/after time measurements are posted in the PR.

Verification

  • The correctness of the code was verified using appropriate tests.
  • If new equations/methods are added, a convergence test has been run and the results
    are posted in the PR.

Created with ❤️ by the Trixi.jl community.

Copy link

codecov bot commented Jun 13, 2024

Codecov Report

Attention: Patch coverage is 2.96610% with 229 lines in your changes missing coverage. Please review.

Project coverage is 87.12%. Comparing base (91eac3c) to head (573133a).

Files Patch % Lines
src/meshes/t8code_mesh.jl 0.00% 162 Missing ⚠️
src/meshes/mesh_io.jl 0.00% 58 Missing ⚠️
src/auxiliary/t8code.jl 58.33% 5 Missing ⚠️
src/callbacks_step/save_restart_dg.jl 0.00% 3 Missing ⚠️
src/callbacks_step/amr.jl 0.00% 1 Missing ⚠️

❗ There is a different number of reports uploaded between BASE (91eac3c) and HEAD (573133a). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (91eac3c) HEAD (573133a)
unittests 25 21
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1980      +/-   ##
==========================================
- Coverage   96.23%   87.12%   -9.10%     
==========================================
  Files         462      462              
  Lines       37075    37233     +158     
==========================================
- Hits        35676    32439    -3237     
- Misses       1399     4794    +3395     
Flag Coverage Δ
unittests 87.12% <2.97%> (-9.10%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jmark jmark marked this pull request as ready for review June 24, 2024 12:40
@benegee benegee self-assigned this Jun 26, 2024
@benegee benegee self-requested a review June 26, 2024 06:54
src/meshes/mesh_io.jl Outdated Show resolved Hide resolved
Copy link
Contributor

@benegee benegee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great feature. Thank a lot!

src/meshes/t8code_mesh.jl Outdated Show resolved Hide resolved
src/meshes/t8code_mesh.jl Outdated Show resolved Hide resolved
src/meshes/t8code_mesh.jl Outdated Show resolved Hide resolved
src/meshes/t8code_mesh.jl Show resolved Hide resolved
src/meshes/t8code_mesh.jl Outdated Show resolved Hide resolved
src/meshes/t8code_mesh.jl Outdated Show resolved Hide resolved
src/meshes/t8code_mesh.jl Show resolved Hide resolved
src/meshes/t8code_mesh.jl Show resolved Hide resolved
@jmark
Copy link
Contributor Author

jmark commented Jul 4, 2024

Do you think the MPI failures are really unrelated?

No, I am not sure. I just know that we had stalling CI jobs before. And, looking trough the recent failures, I think it was not always the same elixir.

I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.

@JoshuaLampert
Copy link
Member

I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.

Yes, could be related to OOM issues, cf. #1471.

@jmark
Copy link
Contributor Author

jmark commented Jul 5, 2024

I get the feeling that the MPI tests are too big now and take too long. We probably have to split them up similar to the serial tests.

Yes, could be related to OOM issues, cf. #1471.

I could narrow it down. It has something to do with Julia 10.1.4. With Julia 10.1.2 it does not stall. Investigating ...

@JoshuaLampert
Copy link
Member

Are you able to reproduce the problem locally?

@jmark
Copy link
Contributor Author

jmark commented Jul 5, 2024

Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for elixir_advection_restart.jl 2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...

@sloede
Copy link
Member

sloede commented Jul 5, 2024

Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for elixir_advection_restart.jl 2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...

Are you sure it's related to the patch version bump? Are you using an identical Manifest.toml for both tests?

@jmark
Copy link
Contributor Author

jmark commented Jul 5, 2024

Yes! With Julia 1.10.2 the t8code MPI tests run successfully. However, with Julia 1.10.4 the MPI test for elixir_advection_restart.jl 2D stalls for whatever reason. Running the elixir with MPI directly (not wrapped in a test set) does not stall. No idea, right now, what's going on ...

Are you sure it's related to the patch version bump? Are you using an identical Manifest.toml for both tests?

Yes! Working from the exact same project folder. Just pointing the Julia binary to either 1.10.2 or 1.10.4.

@JoshuaLampert
Copy link
Member

So it consistently stalls with Julia 1.10.4, but consistently works with Julia 1.10.2 in multiple runs? Did you monitor RAM usage during the simulation?

@jmark
Copy link
Contributor Author

jmark commented Jul 8, 2024

So it consistently stalls with Julia 1.10.4, but consistently works with Julia 1.10.2 in multiple runs? Did you monitor RAM usage during the simulation?

Yes! RAM usage is not out of ordinary.

@jmark jmark marked this pull request as draft July 9, 2024 15:28
@jmark
Copy link
Contributor Author

jmark commented Jul 9, 2024

I think I found the bug causing the stalls in the MPI runs. It was a silent memory leak/segfault. I added the fixes in the last commit. Furthermore, I changed the t8code C interface a tiny bit to simplify the code on Trixi side. This PR has to wait for the next breaking t8code release and specifically for the merge of this PR: DLR-AMR/t8code#1115.

I'll try to push for a major t8code release by the end of next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

saving solution while using T8code meshes
5 participants