client-sdk: Connect to gRPC lazily #784

pro-wh · 2024-10-23T18:55:02Z

Imported from https://app.clickup.com/t/8693p9juv

Prompted by a potential death loop at nexus startup that we've encountered 2 or 3 times now, stemming from a combination of postgres backups and oasis node unavailability. See also pogreb (cache) troubleshooting. Partial Slack copy-paste follows.

The problematic situation is when

a) the pogreb store needs reindexing and

b) the sapphire or emerald nodes are unavailable.Because of (b), nexus initialization keeps failing in a fatal way. (Nexus is lazy about connecting to the node, but not for the runtimes because it uses the SDK in addition to the raw RPC connection.) But by the time that fatal error occurs, pogreb has already started the recovery process, including .bac creation. Then k8s keeps restarting nexus. At this point, if the runtime node became available, all would be good.But it typically stays unavailable or a while, so the .bac files pile up, until eventually the pogreb recovery fails (because it cannot create backups) and nexus terminates before it even tries to connect to the runtime node. At that point, every restart will fail, so when the runtime node eventually comes back up, nexus never learns about it.

We should ideally stop using the SDK, or change it so it can connect lazily.

Unfortunately it is not simple, as far as I can tell. We initialize the SDK client at

nexus/storage/oasis/nodeapi/history/runtime.go

Lines 27 to 33 in f6af89d

    
           sdkConn, err := connections.SDKConnect(ctx, record.ChainContext, archiveConfig.ResolvedRuntimeNode(runtime), fastStartup) 
        
           if err != nil { 
        
           	return nil, err 
        
           } 
        
           sdkClient := sdkConn.Runtime(sdkPT) 
        
           rawConn := connections.NewLazyGrpcConn(*archiveConfig.ResolvedRuntimeNode(runtime)) 
        
           apis[record.ArchiveName] = nodeapi.NewUniversalRuntimeApiLite(sdkPT.Namespace(), rawConn, &sdkClient)

, which establishes a gRPC connection eagerly. We use the sdkClient for several queries (e.g. to send EVM queries), and the SDK does a lot of heavy lifting (data packing/formatting) for us there, unlike in consensus. So we cannot trivially rip out the SDK and use the raw gRPC like we did in consensus; and if we wanted to teach the SDK to use lazy gRPC connections, that's buried under several layers too so we'd have to be careful about it.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

client-sdk: Connect to gRPC lazily #784

client-sdk: Connect to gRPC lazily #784

pro-wh commented Oct 23, 2024

client-sdk: Connect to gRPC lazily #784

client-sdk: Connect to gRPC lazily #784

Comments

pro-wh commented Oct 23, 2024