-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
zookeeper watcher won't update haproxy config when adding/removing new node #137
Comments
Thanks for the bug report. We definitely rely on nerve/zookeeper/synapse at Yelp to manage 100s of constantly changing services, and I'm fairly sure Airbnb depends on it in production as well. Let's dig into this. First off how are you starting Synapse and do you have any output or logs from the Synapse startup? I'd expect to see some stdout that might be helpful in diagnosing what's going on. |
thanks very much for the response, it looks like this issue is a duplication of #134, the ruby version i am using is ruby 2.1.2p95 (2014-05-08) [x86_64-linux-gnu], synapse hung when unsubscribing from zookeeper, is this a known issue that synapse doesn't work with this ruby version? |
I thought that it would work with 2.1, but that may be enough information for me to reproduce tomorrow. If you could paste the stdout you're seeing that would probably help save me some time debugging it. We actually run 1.9 in production, so it might just be 2.X issues. |
Can confirm that synapse works with ruby 1.9.3 and doesn't with 2.x Possibly relevant datapoint: if I set DEBUG to true on 2.x.x (to see all the log messages), I don't get the "setting watch at...." log message until after the first new nerve is started..... |
thanks for the update, here are logs that I collected, hope it helps I, [2015-09-01T15:41:12.843539 #29063] INFO -- Synapse::ZookeeperWatcher: start unsubscribe, expect another log entry for its completion |
That's interesting. I wonder if the per_callback threads are the problem. |
Manual testing shows that removing the :thread => :per_callback parameter from the zookeeper connection seems to solve the problem for me on 2.1.2. That's unfortunate because per_callback delivery has significantly increased the responsiveness of synapse in production for us. I'm going to dig into why per_callback is broken on 2.X |
Definitely something funky going on. As the error indicates, the event delivery thread is trying to join itself. I believe that the issue is basically that we are calling unsubscribe while the event delivery thread is running, which calls the event delivery thread's shutdown method which calls join on itself. I suppose the "strange" thing we're doing is calling watch from within the callback itself. I'll keep poking at this. |
Yea, this is due to an API change in Thread.join between 1.9.X and 2.X, namely that if you try to join with a timeout on the current thread in Ruby 2.X you get and exception but in 1.9.X you just get nil
The issue is that synapse calls unsubscribe from within the callback. This causes the following fun call chain: I can see a few potential solutions:
I am pursuing the second solution for now. Given my understanding of zk-ruby I think that the thread.join is to be expected. |
@igor47, @rickypai I'm actually thinking the better solution is to just not call unsubscribe at all. The wiki page over at https://github.com/zk-ruby/zk/wiki/Events#subscriptions warns against it because your callback will keep getting called, but this is like exactly what Synapse wants to happen. I'll have a PR shortly that just sets the watch once. I believe this is safe because if the path goes away, synapse will crash. The only thing that might go wrong afaict is if we don't trust zk-ruby to actually have a long running watch. @igor47 is that why you guys reset the watch every time? |
Thanks for digging into this. Igor would have better idea about what's the proper fix is for Synapse, but I can look into the fix on the zk side (or at least communicate it) |
@rickypai I don't actually think there is anything wrong with zk-ruby, the semantics seem correct to me. The only potential fix would be for zk-ruby to catch the join on self exception over at https://github.com/zk-ruby/zk/blob/e3274c17e61fd2d0a65fc8d024f63026c07c07b3/lib/zk/threaded_callback.rb#L44. I think it's probably incorrect in general to call unsubscribe from within a callback. If we don't like my proposed fix I can do some shuffling of events on queues somewhere in the zookeeper watcher. |
i think we probably just added the unsubscribe because we misunderstood the zk docs; if the subscription will continue to notify, i think we're okay to avoid clearing/re-setting it. |
@jolynch I think the code should probably accommodate the case for Ruby 2+ thread exceptions so it properly logs the error and returns the state. |
Had anyone ever get synapse+zookeeper+nerve working prperly?
I followed the document on this repo, and encountered a couple of issues:
here is my setup,
zookeeper is listening on port 2181, when I start synapse, there is no service running, there is 0 backend entries in haproxy config, then I bring up a few services, I can see new entries get added to zookeeper by nerve but it seems synapse isn't get notified of this, or it didn't do anything to add the new service to haproxy.
Can anyone please suggest where I am doing wrong? thanks.
The text was updated successfully, but these errors were encountered: