You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Prompted by a potential death loop at nexus startup that we've encountered 2 or 3 times now, stemming from a combination of postgres backups and oasis node unavailability. See also pogreb (cache) troubleshooting. Partial Slack copy-paste follows.
The problematic situation is when
a) the pogreb store needs reindexing and
b) the sapphire or emerald nodes are unavailable.Because of (b), nexus initialization keeps failing in a fatal way. (Nexus is lazy about connecting to the node, but not for the runtimes because it uses the SDK in addition to the raw RPC connection.) But by the time that fatal error occurs, pogreb has already started the recovery process, including .bac creation. Then k8s keeps restarting nexus. At this point, if the runtime node became available, all would be good.But it typically stays unavailable or a while, so the .bac files pile up, until eventually the pogreb recovery fails (because it cannot create backups) and nexus terminates before it even tries to connect to the runtime node. At that point, every restart will fail, so when the runtime node eventually comes back up, nexus never learns about it.
We should ideally stop using the SDK, or change it so it can connect lazily.
Unfortunately it is not simple, as far as I can tell. We initialize the SDK client at
, which establishes a gRPC connection eagerly. We use the sdkClient for several queries (e.g. to send EVM queries), and the SDK does a lot of heavy lifting (data packing/formatting) for us there, unlike in consensus. So we cannot trivially rip out the SDK and use the raw gRPC like we did in consensus; and if we wanted to teach the SDK to use lazy gRPC connections, that's buried under several layers too so we'd have to be careful about it.
The text was updated successfully, but these errors were encountered:
Imported from https://app.clickup.com/t/8693p9juv
.bac
creation. Then k8s keeps restarting nexus. At this point, if the runtime node became available, all would be good.But it typically stays unavailable or a while, so the.bac
files pile up, until eventually the pogreb recovery fails (because it cannot create backups) and nexus terminates before it even tries to connect to the runtime node. At that point, every restart will fail, so when the runtime node eventually comes back up, nexus never learns about it.nexus/storage/oasis/nodeapi/history/runtime.go
Lines 27 to 33 in f6af89d
sdkClient
for several queries (e.g. to send EVM queries), and the SDK does a lot of heavy lifting (data packing/formatting) for us there, unlike in consensus. So we cannot trivially rip out the SDK and use the raw gRPC like we did in consensus; and if we wanted to teach the SDK to use lazy gRPC connections, that's buried under several layers too so we'd have to be careful about it.The text was updated successfully, but these errors were encountered: