Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle incoming resolve messages. #480

Closed
wants to merge 36 commits into from

Conversation

zenhack
Copy link
Contributor

@zenhack zenhack commented Mar 16, 2023

Still TODO:

  • We need to handle disembargos with target = (importedCap = ...).
  • Testing.

@zenhack zenhack marked this pull request as draft March 16, 2023 22:59
Still TODO:

- We need to handle disembargos with target = (importedCap = ...).
- Testing.
The := below is sufficient, since this isn't actually used before that.
This caught a bug: we need to check if ClientState.Metadata is nil,
which is possible if the client itself is null, and seems to happen to
the promise after it is resolved.

TODO: we should de-dup the logic between releaseExport and
releaseExports. This is not entirely trivial though, because the
latter is executed after we've wiped the exports table.
@zenhack zenhack marked this pull request as ready for review March 19, 2023 01:49
@zenhack zenhack requested a review from lthibault March 19, 2023 01:50
@zenhack zenhack changed the title WIP: handle incoming resolve messages. Handle incoming resolve messages. Mar 19, 2023
@zenhack
Copy link
Contributor Author

zenhack commented Mar 19, 2023

Alright, this is done.

This basically completes the promise resolution detour from 3PH, Though there's a semantic thing that I don't like, which is that Fulfill(client) feels like it should take ownership of client, but doesn't. I may submit another PR changing this before moving on.

@lthibault
Copy link
Collaborator

I'm very much in favor of taking the time to make reference ownership semantics consistent. 👍

lthibault
lthibault previously approved these changes Mar 20, 2023
Copy link
Collaborator

@lthibault lthibault left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@lthibault
Copy link
Collaborator

Test is timing out.

Re-running.

@zenhack
Copy link
Contributor Author

zenhack commented Mar 20, 2023

Note that there does appear to be a real deadlock that needs to be addressed; the test failures are not spurious.

...now that the resolver takes ownership.
lthibault
lthibault previously approved these changes Mar 21, 2023
@zenhack
Copy link
Contributor Author

zenhack commented Mar 22, 2023

Status update: there are at least two bugs, one which is a deadlock, and another that is a message-reordering problem.

The deadlock is "solved" by this change to the test:

diff --git a/rpc/senderpromise_test.go b/rpc/senderpromise_test.go
index 3bcaec4..7cb5ea2 100644
--- a/rpc/senderpromise_test.go
+++ b/rpc/senderpromise_test.go
@@ -392,6 +392,7 @@ func TestPromiseOrdering(t *testing.T) {
 		if i == 100 {
 			go func() {
 				bs := testcapnp.PingPong(c1.Bootstrap(ctx))
+				assert.NoError(t, bs.Resolve(ctx))
 				r.Fulfill(bs)
 			}()
 		}

So the problem I think arises when a promise resolves to another promise (in this case a question; not sure if that matters).

I'm attaching a log for the ordering problem. Need to dig in still.
log.txt

@zenhack
Copy link
Contributor Author

zenhack commented Mar 23, 2023

Solved the deadlock (kinda... see the commit message), still need to work out the ordering problem.

Worryingly, this was manifesting in the test as a deadlock: we hit the
error complaining about it not being an import, but then connection
shutdown hung, waiting on tasks. I haven't pinned down exactly what was
going on there, but this sidesteps the issue by fixing the thing that
was causing a connection abort in the first place.
lthibault
lthibault previously approved these changes Mar 23, 2023
@lthibault
Copy link
Collaborator

@zenhack Do we have an issue open for the shutdown bug?

@zenhack
Copy link
Contributor Author

zenhack commented Mar 23, 2023

Not that I know of; I have no memory of having seen something like this before.

@zenhack
Copy link
Contributor Author

zenhack commented Mar 30, 2023

I merged back in main, and also reworked the way NewLocalPromise works to avoid some dodgy code that I think has race conditions. I held out some hope that this would fix the ordering bug, but it seems to have no effect.

@zenhack
Copy link
Contributor Author

zenhack commented Mar 30, 2023

Discussed some on matrix, but for posterity: it's not clear to me that the tribble 4-way race condition actually requires three distinct vats, so it could be a problem even in a level 1 implementation, if each link in the promise chain crosses the network. The failure we're seeing matches this description, so I am wondering if it is a symptom of the 4-way race condition. The flow looks something like:

  • vat A requests vat B's bootstrap interface, and starts making calls on it (before receiving the return). The question for the bootstrap is the first promise.
  • vat B returns a senderPromsie as its bootstrap.
  • At some point later, vat B requests vat A's bootstrap interface and fulfills the promise it sent with the result. Note that the bug crops up regardless of whether or not vat B waits for the return for the bootstrap message. If it does not there are three promises, but even if it does there are still two.

I'm going to put this PR down until I've more deeply understood the implications. I still haven't totally grokked what exactly goes wrong in the 4-way race condition, only that there is an assumption being violated (i.e. I don't know what an example failure due to this issue should look like). I plan on figuring this out, and probably also stepping back to fix the way we do promise resolution in the library; regardless of whether this hunch is correct, we'll need to for 3PH.

@zenhack
Copy link
Contributor Author

zenhack commented Mar 31, 2023

Per #494 (comment), my suspicion was correct: 3PH is not required for the race, and it seems like a good bed that's what's causing the failure. So I'll work on that refactor before coming back to this.

...instead of a Client. This is a step towards addressing the tribble
4-way race condition. Currently there are some test failures.
The call to waitRef.Release() in export.go was resolving the receiver
too soon, resulting in a leak.
@zenhack
Copy link
Contributor Author

zenhack commented May 27, 2023

I've merged in main and #520, still reconciling the latter.

zenhack added 14 commits May 27, 2023 00:06
Has build errors that need fixing.
I plan on adding *Snapshot versions and eventually storing snapshots
internally.
This seems like an unnecessary complication, and I don't want to deal
with it when porting stuff over to snapshots.
...and use it to catch a bug; the test is failing, need to track it
down.
Otherwise the target message steals the caps, resulting in errors
later when we try to read them.
Per capnproto#523, this is currently failing frequently. Mark it as flaky
until we're ready to actually deal with it.
In addition to letting us hand out borrowed references, Storing the
snapshots will facilitate code that needs them not to resolve further
after AddSnapshot() or SetSnapshot().
@zenhack
Copy link
Contributor Author

zenhack commented Jun 17, 2023

Superseded by #530, closing.

@zenhack zenhack closed this Jun 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants