Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add advanced checkpoint/restart capabilities to Hiop #686

Open
tepperly opened this issue May 31, 2024 · 0 comments · May be fixed by #693
Open

Add advanced checkpoint/restart capabilities to Hiop #686

tepperly opened this issue May 31, 2024 · 0 comments · May be fixed by #693

Comments

@tepperly
Copy link
Member

Many HPC applications need to implement a checkpoint/restart capability to address either:

  • maximum job time allocations (e.g., no jobs can run for more than 24 hours)
  • fault tolerance (e.g., individual parts of a supercomputer can fail leading to premature job failure)

To address these concerns, typical HPC applications periodically write their internal state to hard drive (checkpointing) and then have the ability to restart and resume progress from the last checkpoint file.

Hiop has some ability to warm start through the get_starting_point() or get_warm_start(). However, it would be better if Hiop could save more of its internal state to do a better restart.

The ::axom::sidre package provides a flexible checkpoint/restart API and implementation. Multiple LLNL packages use ::axom::sidre. The repository is here. It's buildable with Spack.

@cnpetra cnpetra changed the title Add advanced checkpont/restart capabilities to Hiop Add advanced checkpoint/restart capabilities to Hiop Jun 3, 2024
@cnpetra cnpetra linked a pull request Sep 4, 2024 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant