DnsWatcher < BaseWatcher #172

bkochendorfer · 2016-02-18T18:47:51Z

We were trying out this service at our organization in an internal datacenter and we temporarily lost internet connection. Because this is resolving the external address of airbnb being required as part of 'ping?' it caused synapse to fail entirely. Perhaps throwing a recoverable error rather than a runtime exception would help prevent the entire cluster from collapsing.

I, [2016-02-17T17:34:09.333142 #8263]  INFO -- Synapse::Haproxy: synapse: restarted haproxy
E, [2016-02-18T16:58:00.932196 #8263] ERROR -- Synapse::Synapse: synapse: encountered unexpected exception #<RuntimeError: synapse: service watcher mongo failed ping!> in main thread
W, [2016-02-18T16:58:00.932830 #8263]  WARN -- Synapse::Synapse: synapse: exiting; sending stop signal to all watchers
I, [2016-02-18T16:58:00.932962 #8263]  INFO -- Synapse::ZookeeperDnsWatcher: synapse: stopping watcher mongo using default stop handler
W, [2016-02-18T16:58:00.933561 #8263]  WARN -- Synapse::ZookeeperDnsWatcher::Zookeeper: synapse: zookeeper watcher exiting
I, [2016-02-18T16:58:00.934565 #8263]  INFO -- Synapse::ZookeeperDnsWatcher::Zookeeper: synapse: zookeeper watcher cleaning up
I, [2016-02-18T16:58:00.934961 #8263]  INFO -- Synapse::ZookeeperDnsWatcher::Zookeeper: synapse: closing zk connection to 172.16.151.82:2181,172.16.151.86:2181,172.16.151.87:2181
I, [2016-02-18T16:58:00.937347 #8263]  INFO -- Synapse::ZookeeperDnsWatcher::Zookeeper: synapse: zookeeper watcher cleaned up successfully
/var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:54:in `block (2 levels) in run': synapse: service watcher mongo failed ping! (RuntimeError)
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:53:in `each'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:53:in `block in run'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:52:in `loop'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:52:in `run'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/bin/synapse:60:in `<top (required)>'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/.bin/synapse:23:in `load'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/.bin/synapse:23:in `<main>'

igor47 · 2016-02-18T21:38:08Z

the synapse philosophy is fail-fast; i would be opposed to allowing a dead watcher to stick around.

why did your cluster collapse? haproxy should have continued running with the last known good configuration. it should be fine for synapse to bail -- it should only break discovery of added/removed backends.

we should probably make the airbnb.com thing configurable to the watcher. i would welcome a PR to do that.

JustinVenus · 2016-02-18T22:21:33Z

Mesos (thermos_executer from Apache/Aurora) is supervising synapse ... when synapse exited on a single failed watcher (due to a resolver issue caused by a network interruption) all jobs in the Cgroup went with it. The Mesos cluster collapsed, because of a DNS interruption as we our DNS servers were unable to forward a lookup for airbnb.com (probably also a risk if the Airbnb resolvers ever failed).

We use synapse/haproxy only where it is absolutely required (effectively a side car process). We do this, because we don't want to health check all of our services from every machine in the fleet.

Our short term fix to not use the DNS resolver at all. I would prefer to treat a DNS failure as a transient event and retry until the records have at least hit the TTL before bailing out of the whole process.

Totally open to hashing out a design that works for both of us.

igor47 · 2016-02-18T22:42:36Z

a transient dns failure is not a risk for us, because synapse exiting (even exiting on every machine at once) is not a risk for us. likely, things would recover on their own, since we use a process supervisor (runit) which would just restart synapse. worst case, we would require manual intervention. either way, this scenario would not immediately threaten the stability of airbnb.

i am not familiar with the technologies you're using, but i recommend attempting to create a similar configuration. a single component failing should never take down your entire infrastructure.

jolynch · 2016-02-19T04:41:51Z

@JustinVenus

If you're running Synapse inside a container as a sidecar, you can run it under a supervisor and everything should work out. Ideally since you have multiple processes in a single container you should be using a proper init replacement anyways (e.g.runit, supervisord, etc).

That being said, best practice is to run Synapse and HAProxy outside your containers. We do this at Yelp (where we uae mesos/marathon) and it's worked wonderfully. HAProxy binds to a link local IP and all containers can talk to it. This also makes the system much more scalable because you don't have hundreds (or sometimes thousands) of connections to your service registry from every host.

jolynch · 2016-02-19T04:46:58Z

Also if healthchecking is a concern you can just turn off healthchecking except on error or just turn it off all together.

We leave healthchecking on at a low rate (leveraging on-error and fast-inter to make the rate high on error) and use hacheck (many use varnish) to soak up the healthcheck traffic. This is extra nice because we can gracefully fail boxes out of the SmartStack to do things like docker upgrades or system reboots. This architecture has scaled very well for an infra of many thousands of machines in many datacenters running hundreds of services that are constantly moving around.

JustinVenus · 2016-02-19T15:40:57Z

@jolynch We are running synapse as process as part of an Aurora Task. The issue is if one process exits != 0 then the task group is considered a failure and the whole job fails. Normally this isn't a big deal. However synapse/haproxy is bailing out on a single DNS resolve failure, and this is across the fleet causing mass rescheduling of Mesos jobs (thundering herd). I'm going to post a patch in a bit to handle the exception gracefully. I think it can be handled with a 3 line change.

JustinVenus mentioned this issue Feb 19, 2016

[issue-172:dns] Gracefully fail DNS resolver interruptions #173

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DnsWatcher < BaseWatcher #172

DnsWatcher < BaseWatcher #172

bkochendorfer commented Feb 18, 2016

igor47 commented Feb 18, 2016

JustinVenus commented Feb 18, 2016

igor47 commented Feb 18, 2016

jolynch commented Feb 19, 2016

jolynch commented Feb 19, 2016

JustinVenus commented Feb 19, 2016

DnsWatcher < BaseWatcher #172

DnsWatcher < BaseWatcher #172

Comments

bkochendorfer commented Feb 18, 2016

igor47 commented Feb 18, 2016

JustinVenus commented Feb 18, 2016

igor47 commented Feb 18, 2016

jolynch commented Feb 19, 2016

jolynch commented Feb 19, 2016

JustinVenus commented Feb 19, 2016