Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error handling on failed writes #904

Draft
wants to merge 10 commits into
base: dev
Choose a base branch
from
Draft

Conversation

lkotipal
Copy link
Contributor

@lkotipal lkotipal commented Feb 5, 2024

Requires fmihpc/vlsv#56 so failed writes actually return errors. Addresses #903.

  • Failed bulk writes are fatal. If the disk/quota is full, failed writes essentially cause data loss and unreliable output. Due to writing cadence we can't expect someone to be always on hand to clear disk space.
  • Try to clean up failed restart writes by removing the malformed file. Since restart writes are less common and much larger than bulk files, it's much easier to recover from a failed restart write. By removing the failed restart, the simulation can possibly run normally for at least some time, hopefully enough for someone to clear disk space and manually trigger a restart write.

@ykempf
Copy link
Contributor

ykempf commented Apr 23, 2024

Next time, call it "retrigger CI" or something, or just "CI", instead of "dummy". :)

vlasiator.cpp Outdated Show resolved Hide resolved
vlasiator.cpp Outdated Show resolved Hide resolved
@ykempf
Copy link
Contributor

ykempf commented Apr 23, 2024

So from our f2f2z discussion:

  • just get the file name out from writeRestart instead of plonking it into P:: as we don't seem to be needing it there atm, and reduces the risk of somehow nuking also the previous file
  • the MPI_Barrier is risky atm if not all tasks return a false.

@lkotipal lkotipal marked this pull request as draft April 23, 2024 10:37
@lkotipal
Copy link
Contributor Author

Addressed points from the discussion, but untested for now

@ykempf
Copy link
Contributor

ykempf commented Apr 23, 2024

Right so now I agree with the logic of the checks around the file removal. Let's see about testing as just discussed.

@lkotipal
Copy link
Contributor Author

lkotipal commented Apr 23, 2024

So for the test status on this -- the success path is checked by CI. Failure path is a bit tougher, since Vlasiator is too smart to try to write to paths without access. I did try hardcoding to /dev/full in iowrite, which does fail non-fatally as expected but doesn't really check all possible failure modes or the file deletion feature.

@markusbattarbee
Copy link
Contributor

Have you verified that the string containing the restart file name is correct, even in case of a successful write?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants