Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic error call on null client for first RPC call after upgrading from 3.0.0-alpha.30 to 3.0.1-alpha.1, ~1% chance #572

Open
mologie opened this issue Jun 13, 2024 · 3 comments
Labels

Comments

@mologie
Copy link

mologie commented Jun 13, 2024

After updating to 3.0.1-alpha.1 from 3.0.0-alpha.30, a load test of mine now fails occasionally with error schema/avas/corehost/corehost.capnp:CoreFactory.buildInfo: call on null client.

I am trying to build a minimal reproducer to narrow this down and post it here. (But that might take some time, I'm deferring to update for now.) In the meantime I'd leave the following details here in case someone else stumbles over the issue:

  • The environment is a Go HTTP server, which connects to a C++ capnp RPC server within a HTTP request.
  • 0.1-2% of requests fail with capnp 3.0.1-alpha.1 regardless for request rate or parallel requests (high variance in failure rate). More requests make it fail more often, linear correlation.
  • The affected code is heavily used in production with tens of thousands of daily connections, without issues with capnp 3.0.0-alpha.30.
  • No errors are logged to Logger.

The affected code is as simple as the following after a successful connection:

// (establishment + error checks for tcpConn omitted)
conn := rpc.NewConn(rpc.NewStreamTransport(tcpConn), &rpc.Options{
	Logger: &rpcErrorReporter{sess: s},
})
factory := corehost.CoreFactory(conn.Bootstrap(s.ctx))
buildInfoFuture, buildInfoRelease := factory.BuildInfo(s.ctx, nil)
defer buildInfoRelease()
buildInfo, err := buildInfoFuture.Struct() // err is not nil for 0.2-2% of requests
@mologie
Copy link
Author

mologie commented Jun 13, 2024

tl;dr: There are at least two commits that break this.

I have bisected this and found that commit 2815a6a introduces the error. 91bffc5 (which merged 2815a6a, other commits of that MR are docs changes) nets me "call on null client" errors, 91bffc5^ (which incidentally is 3.0.0-alpha.30) does not.

Reverting 91bffc5 on current main (2f9aa4f) is free of conflicts but fails to compile:

go-capnp/rpc/rpc.go:378:14: bc.c.State undefined (type capnp.Client has no field or method State)

I fixed it like so, but this is a pure guess and I have not validated anything about this and have no idea what I'm doing:

diff --git a/rpc/rpc.go b/rpc/rpc.go
index 48b3e86e2d2d37c2c1285036a123a4a130155cda..5fdb3cbdadd51dcf8b23172a40e33339a548359b 100644
--- a/rpc/rpc.go
+++ b/rpc/rpc.go
@@ -375,7 +375,7 @@ func (bc bootstrapClient) Recv(ctx context.Context, r capnp.Recv) capnp.Pipeline
 }
 
 func (bc bootstrapClient) Brand() capnp.Brand {
-	return bc.c.State().Brand
+	return bc.c.Snapshot().Brand()
 }
 
 func (bc bootstrapClient) Shutdown() {

However this now nets me an even higher error rate than the original change, so another commit between the bit that fixed it and current main must introduce a similar error. When starting to rebase, the 2nd merge cab1c2e already reintroduced the error, and reverting it is not possible due to major changes in capability.go.

I stopped at that point, would rather have a reproducer on top of main and work backwards from there.

@mologie
Copy link
Author

mologie commented Jun 13, 2024

Running tests on main nets me the following:

--- FAIL: TestPromiseOrdering (0.08s)
    senderpromise_test.go:404:
        	Error Trace:	/Users/okuckertz/Devel/go-capnp/rpc/senderpromise_test.go:404
        	Error:      	Received unexpected error:
        	            	test.capnp:PingPong.echoNum: call on null client
        	Test:       	TestPromiseOrdering
        	Messages:   	call #0 should succeed
FAIL
FAIL	capnproto.org/go/capnp/v3/rpc	0.550s

So this is probably related to or a duplicate of #543 (probably not a duplicate, test results differ, see my comment in the other issue).

@lthibault
Copy link
Collaborator

@mologie Firstly, thank you so much for your help on this issue.

Can you try resolving the capability (factory.Resolve(ctx) ought to do it) before making any RPC calls? I suspect this will suffice as a short-term workaround, although we'll obviously want to tackle the underlying issue.

As far as the underlying issue is concerned, Zenhack suggested that his prior work on WeakRef may have uncovered the underlying bug, so that might be a good thread to follow.

Please feel free to join our Matrix community if you need any support. I'm happy to assist however I can.

@lthibault lthibault added the bug label Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants