Recovery from out of disk space state? #605
Replies: 2 comments
-
Interesting, we do have out of disk space recovery tests for the database in LXD here: https://github.com/lxc/lxd/blob/master/test/suites/database.sh#L86 This is different from a cluster out of space failover though, this was the next thing on our list to handle in jepsen tests before we got diverted to work on the microk8s issues, so we'll definitely get back to that. In general we're not treating this one as top priority because:
For the first case, I know we considered automatically discarding bad segments when on a single node cluster, assuming we can back up what we're removing somehow, this should be safe. @freeekanayaka does that match your recollection too? |
Beta Was this translation helpful? Give feedback.
-
The single node case should be already safe: if you run out of space, no corruption will occur, and as soon as disk space becomes available again, the node should get back to a healthy state. At least that's what the LXD test you link proves in case of LXD. Would be nice to check with microk8s too. In the second case (HA), what I had in mind was to find some way for a node to report to the leader its out of disk situation, if a quorum of nodes ends up being out of space at least you can report a meaningful message to the user. In terms of safety I'm not sure we saw corruption in the HA case with the current code, but one would need to resume those jepsen tests and take a closer look. It might well be that there are also cases of corruption (as Kostantinos seems to hint for the microk8s case). |
Beta Was this translation helpful? Give feedback.
-
It is not clear how to recover from a situation where the nodes run out of disk space.
There are cases where the nodes are seemingly operational but they do not report to any requests, cases were the nodes to not start at all even after adding more disk space and we have data corruption reports from users.
This error case can be easily reproduced with the dqlite-demo app pointed to a small partition.
Beta Was this translation helpful? Give feedback.
All reactions