Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed deadlock possible when peer lists are out of sync #72

Open
justinrixx opened this issue Sep 1, 2024 · 0 comments
Open

Distributed deadlock possible when peer lists are out of sync #72

justinrixx opened this issue Sep 1, 2024 · 0 comments

Comments

@justinrixx
Copy link

justinrixx commented Sep 1, 2024

Hi there.

This is an edge case I ran into which can cause a specific key to hang forever without resolving. The peer list is a consistent hash based on pool membership. When pool membership changes, who owns which key may change. This can happen in a lot of scenarios: a rolling restart as part of a deployment, scaling out/in, etc. If a request comes in at just the right time and you're unlucky, it may go something like this:

  • there are 3 peers: A, B, and C
  • object xyz is owned by peer A based on the hash, though it is not currently cached (e.g. hasn't been requested in a while)
  • peer C leaves the pool, and object xyz should now be owned by peer B, based on the new distribution
  • peer A updates its peer list (e.g. by being notified by consul or zookeeper)
  • a request comes in to peer A for object xyz
  • peer A does not have the object cached locally
  • peer A determines that the object is owned by peer B and makes a crosstalk request for it
  • peer B has not updated its peer list yet; it still thinks peer A owns object xyz
  • peer B makes a crosstalk request to peer A for object xyz
  • peer A receives the crosstalk request from peer B. the singleflight group recognizes the object is already being retrieved, so it waits
  • peer B updates its peer list
  • any further requests that come in to either peer A or peer B will ultimately end up waiting on the crosstalk request to complete due to the singleflight group (hopefully there's a timeout on the context!)

The solution to this is to be a little bit smarter when handling crosstalk requests. As written, the code prioritizes making the request to the correct peer over all else. Instead, I propose adding an additional rule: if a request for an object came in on the crosstalk endpoint, it's because a peer believes I own the object. Whether I believe I know who owns it or not, I should not make another crosstalk request.

Or more succinctly: requests which came in on the crosstalk ingress must not make egress crosstalk requests.

Please have a look at this readme and code where I have written up a workaround for this issue.

I propose a fix be put into place within this project though so that the workaround is not needed. One idea is to have an alternative version of group.Get() which takes a boolean flag to control whether egress crosstalk requests are allowed. That alternative version would be called in the HTTP handler here.

If the proposed fix sounds reasonable and worth pursuing to you, I would be willing to put together a PR.

Please feel free to ask anything you need clarification on. This edge case had me pulling my hair out and I'd like to save others the pain. The original golang/groupcache project is not active, and this is the most active fork, which is why I've come here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant