Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

zookeeper watcher won't update haproxy config when adding/removing new node #137

Closed
hankya opened this issue Aug 31, 2015 · 14 comments · Fixed by #138
Closed

zookeeper watcher won't update haproxy config when adding/removing new node #137

hankya opened this issue Aug 31, 2015 · 14 comments · Fixed by #138

Comments

@hankya
Copy link

hankya commented Aug 31, 2015

Had anyone ever get synapse+zookeeper+nerve working prperly?

I followed the document on this repo, and encountered a couple of issues:

  • synapse works with yaml config file however the json format config file is provided
  • there is a few config items that should have been array however it is a single line in example config
  • finally i was able to fix above problem, however, when taking a service online or offline, the change does not reflect in haproxy

here is my setup,

  • synapse config
 haproxy:
  reload_command: "sudo service haproxy reload"
  config_file_path: "/etc/haproxy/haproxy.cfg"
  socket_file_path: "/var/haproxy/haproxy.sock"
  do_writes: True
  do_reloads: True
  do_socket: True
  global:
   - "daemon"
   - "user    haproxy"
   - "group   haproxy"
   - "maxconn 4096"
   - "log     127.0.0.1 local2 notice"
   - "stats   socket /var/haproxy/haproxy.sock mode 666 level admin"
  defaults:
   - "log      global"
   - "balance  roundrobin"
   - "timeout connect 5s"
   - "timeout client 1m"
   - "timeout server 1m"
  extra_sections:
   "listen stats :3212":
    - "mode http"
    - "stats enable"
    - "stats uri /"
    - "stats refresh 5s"
 services:
  hellox:
   discovery:
    method: "zookeeper"
    path: "/nerve/services/hello/services"
    hosts:
     - "localhost:2181"
   haproxy:
    port: 3214
    server_options: "check inter 2s rise 3 fall 2"
    listen:
     - "mode http"
  • nerve config
{
  "instance_id": "localhost",
  "services": {
    "hello": {
      "host": "localhost",
      "port": 3000,
      "reporter_type": "zookeeper",
      "zk_hosts": ["localhost:2181"],
      "zk_path": "/nerve/services/hello/services",
      "check_interval": 2,
      "checks": [
        {
          "type": "http",
          "uri": "/health",
          "timeout": 0.2,
          "rise": 3,
          "fall": 2
        }
      ]
    },
    "hello2": {
      "host": "localhost",
      "port": 2999,
      "reporter_type": "zookeeper",
      "zk_hosts": ["localhost:2181"],
      "zk_path": "/nerve/services/hello/services",
      "check_interval": 2,
      "checks": [
        {
          "type": "http",
          "uri": "/health",
          "timeout": 0.2,
          "rise": 3,
          "fall": 2
        }
      ]
    },
    "hello3": {
      "host": "localhost",
      "port": 2998,
      "reporter_type": "zookeeper",
      "zk_hosts": ["localhost:2181"],
      "zk_path": "/nerve/services/hello/services",
      "check_interval": 2,
      "checks": [
        {
          "type": "http",
          "uri": "/health",
          "timeout": 0.2,
          "rise": 3,
          "fall": 2
        }
      ]
    },
    "hello4": {
      "host": "localhost",
      "port": 2997,
      "reporter_type": "zookeeper",
      "zk_hosts": ["localhost:2181"],
      "zk_path": "/nerve/services/hello/services",
      "check_interval": 2,
      "checks": [
        {
          "type": "http",
          "uri": "/health",
          "timeout": 0.2,
          "rise": 3,
          "fall": 2
        }
      ]
    }
  }
}

zookeeper is listening on port 2181, when I start synapse, there is no service running, there is 0 backend entries in haproxy config, then I bring up a few services, I can see new entries get added to zookeeper by nerve but it seems synapse isn't get notified of this, or it didn't do anything to add the new service to haproxy.

Can anyone please suggest where I am doing wrong? thanks.

@jolynch
Copy link
Collaborator

jolynch commented Aug 31, 2015

Thanks for the bug report. We definitely rely on nerve/zookeeper/synapse at Yelp to manage 100s of constantly changing services, and I'm fairly sure Airbnb depends on it in production as well.

Let's dig into this. First off how are you starting Synapse and do you have any output or logs from the Synapse startup? I'd expect to see some stdout that might be helpful in diagnosing what's going on.

@hankya
Copy link
Author

hankya commented Aug 31, 2015

thanks very much for the response, it looks like this issue is a duplication of #134, the ruby version i am using is ruby 2.1.2p95 (2014-05-08) [x86_64-linux-gnu], synapse hung when unsubscribing from zookeeper, is this a known issue that synapse doesn't work with this ruby version?

@jolynch
Copy link
Collaborator

jolynch commented Aug 31, 2015

I thought that it would work with 2.1, but that may be enough information for me to reproduce tomorrow. If you could paste the stdout you're seeing that would probably help save me some time debugging it. We actually run 1.9 in production, so it might just be 2.X issues.

@drewcsillag
Copy link

Can confirm that synapse works with ruby 1.9.3 and doesn't with 2.x

Possibly relevant datapoint: if I set DEBUG to true on 2.x.x (to see all the log messages), I don't get the "setting watch at...." log message until after the first new nerve is started.....

@hankya
Copy link
Author

hankya commented Sep 1, 2015

thanks for the update, here are logs that I collected, hope it helps

I, [2015-09-01T15:41:12.843539 #29063] INFO -- Synapse::ZookeeperWatcher: start unsubscribe, expect another log entry for its completion
E, [2015-09-01T15:41:12.843689 #29063] ERROR -- Synapse::ZookeeperWatcher: Target thread must not be current thread (ThreadError)
/var/lib/gems/2.1.0/gems/zk-1.9.5/lib/zk/threaded_callback.rb:44:in join' /var/lib/gems/2.1.0/gems/zk-1.9.5/lib/zk/threaded_callback.rb:44:inshutdown'
/var/lib/gems/2.1.0/gems/zk-1.9.5/lib/zk/event_handler_subscription/actor.rb:42:in unregister' /var/lib/gems/2.1.0/gems/zk-1.9.5/lib/zk/subscription.rb:45:inunsubscribe'
/var/lib/gems/2.1.0/gems/synapse-0.12.1/lib/synapse/service_watcher/zookeeper.rb:89:in watch' /var/lib/gems/2.1.0/gems/synapse-0.12.1/lib/synapse/service_watcher/zookeeper.rb:108:inblock in watcher_callback'
/var/lib/gems/2.1.0/gems/zk-1.9.5/lib/zk/threaded_callback.rb:146:in call' /var/lib/gems/2.1.0/gems/zk-1.9.5/lib/zk/threaded_callback.rb:146:indispatch_thread_body'
/var/lib/gems/2.1.0/gems/logging-1.8.2/lib/logging/diagnostic_context.rb:323:in call' /var/lib/gems/2.1.0/gems/logging-1.8.2/lib/logging/diagnostic_context.rb:323:inblock in create_with_logging_context'

@jolynch
Copy link
Collaborator

jolynch commented Sep 3, 2015

That's interesting. I wonder if the per_callback threads are the problem.

@jolynch
Copy link
Collaborator

jolynch commented Sep 3, 2015

Manual testing shows that removing the :thread => :per_callback parameter from the zookeeper connection seems to solve the problem for me on 2.1.2.

That's unfortunate because per_callback delivery has significantly increased the responsiveness of synapse in production for us. I'm going to dig into why per_callback is broken on 2.X

@jolynch
Copy link
Collaborator

jolynch commented Sep 3, 2015

Definitely something funky going on. As the error indicates, the event delivery thread is trying to join itself. I believe that the issue is basically that we are calling unsubscribe while the event delivery thread is running, which calls the event delivery thread's shutdown method which calls join on itself.

I suppose the "strange" thing we're doing is calling watch from within the callback itself. I'll keep poking at this.

@jolynch
Copy link
Collaborator

jolynch commented Sep 3, 2015

Yea, this is due to an API change in Thread.join between 1.9.X and 2.X, namely that if you try to join with a timeout on the current thread in Ruby 2.X you get and exception but in 1.9.X you just get nil

~/pg/synapse-airbnb(branch:fix_ruby_2_x_support) » rbenv shell 2.1.2                                                                          
------------------------------------------------------------
~/pg/synapse-airbnb(branch:fix_ruby_2_x_support) » rbenv rehash                                                                               
------------------------------------------------------------
~/pg/synapse-airbnb(branch:fix_ruby_2_x_support) » ruby --version                                                                             
ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin14.0]
------------------------------------------------------------
~/pg/synapse-airbnb(branch:fix_ruby_2_x_support) » irb --version                                                                              
irb 0.9.6(09/06/30)
------------------------------------------------------------
~/pg/synapse-airbnb(branch:fix_ruby_2_x_support) » irb                                                                                        
irb(main):001:0> Thread.current.join(1)
ThreadError: Target thread must not be current thread
    from (irb):1:in `join'
    from (irb):1
    from /Users/jlynch/.rbenv/versions/2.1.2/bin/irb:11:in `<main>'
irb(main):002:0> exit
------------------------------------------------------------
~/pg/synapse-airbnb(branch:fix_ruby_2_x_support) » rbenv shell 1.9.3-p448                                                                     
------------------------------------------------------------
~/pg/synapse-airbnb(branch:fix_ruby_2_x_support) » ruby --version                                                                             
ruby 1.9.3p448 (2013-06-27 revision 41675) [x86_64-darwin14.4.0]
------------------------------------------------------------
~/pg/synapse-airbnb(branch:fix_ruby_2_x_support) » irb                                                                                        
irb(main):001:0> Thread.current.join(1)
=> nil
irb(main):002:0>
--------------------

The issue is that synapse calls unsubscribe from within the callback. This causes the following fun call chain:
Actor.unregister =>
=> https://github.com/zk-ruby/zk/blob/e3274c17e61fd2d0a65fc8d024f63026c07c07b3/lib/zk/event_handler_subscription/actor.rb#L42
=> https://github.com/zk-ruby/zk/blob/e3274c17e61fd2d0a65fc8d024f63026c07c07b3/lib/zk/threaded_callback.rb#L44
=> Thread.join on current thread with a timeout

I can see a few potential solutions:

  1. Turn off per_callback threading on the underlying connection (this is very sad when you have lots of backends that share servers, because they all queue up on each other and you end up with seriously delayed processing)
  2. Figure out a way to reliably call unsubscribe from a Thread outside the callback
  3. Figure out a way for zk-ruby not to join the callback thread when you call unsubscribe

I am pursuing the second solution for now. Given my understanding of zk-ruby I think that the thread.join is to be expected.

@jolynch
Copy link
Collaborator

jolynch commented Sep 3, 2015

@igor47, @rickypai I'm actually thinking the better solution is to just not call unsubscribe at all. The wiki page over at https://github.com/zk-ruby/zk/wiki/Events#subscriptions warns against it because your callback will keep getting called, but this is like exactly what Synapse wants to happen.

I'll have a PR shortly that just sets the watch once. I believe this is safe because if the path goes away, synapse will crash. The only thing that might go wrong afaict is if we don't trust zk-ruby to actually have a long running watch. @igor47 is that why you guys reset the watch every time?

@evie404
Copy link
Collaborator

evie404 commented Sep 3, 2015

Thanks for digging into this. Igor would have better idea about what's the proper fix is for Synapse, but I can look into the fix on the zk side (or at least communicate it)

jolynch pushed a commit that referenced this issue Sep 3, 2015
Addresses incompatibility of per_callback threading in ZK, the way that
synapse calls unsubscribe within the callback, and ruby 2.X really not
liking joining on the current running thread.

Should fix #137 #99 and possibly #94
@jolynch
Copy link
Collaborator

jolynch commented Sep 3, 2015

@rickypai I don't actually think there is anything wrong with zk-ruby, the semantics seem correct to me. The only potential fix would be for zk-ruby to catch the join on self exception over at https://github.com/zk-ruby/zk/blob/e3274c17e61fd2d0a65fc8d024f63026c07c07b3/lib/zk/threaded_callback.rb#L44.

I think it's probably incorrect in general to call unsubscribe from within a callback. If we don't like my proposed fix I can do some shuffling of events on queues somewhere in the zookeeper watcher.

@igor47
Copy link
Collaborator

igor47 commented Sep 3, 2015

i think we probably just added the unsubscribe because we misunderstood the zk docs; if the subscription will continue to notify, i think we're okay to avoid clearing/re-setting it.

@evie404
Copy link
Collaborator

evie404 commented Sep 4, 2015

@jolynch I think the code should probably accommodate the case for Ruby 2+ thread exceptions so it properly logs the error and returns the state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants