Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DnsWatcher < BaseWatcher #172

Open
bkochendorfer opened this issue Feb 18, 2016 · 6 comments
Open

DnsWatcher < BaseWatcher #172

bkochendorfer opened this issue Feb 18, 2016 · 6 comments

Comments

@bkochendorfer
Copy link

We were trying out this service at our organization in an internal datacenter and we temporarily lost internet connection. Because this is resolving the external address of airbnb being required as part of 'ping?' it caused synapse to fail entirely. Perhaps throwing a recoverable error rather than a runtime exception would help prevent the entire cluster from collapsing.

I, [2016-02-17T17:34:09.333142 #8263]  INFO -- Synapse::Haproxy: synapse: restarted haproxy
E, [2016-02-18T16:58:00.932196 #8263] ERROR -- Synapse::Synapse: synapse: encountered unexpected exception #<RuntimeError: synapse: service watcher mongo failed ping!> in main thread
W, [2016-02-18T16:58:00.932830 #8263]  WARN -- Synapse::Synapse: synapse: exiting; sending stop signal to all watchers
I, [2016-02-18T16:58:00.932962 #8263]  INFO -- Synapse::ZookeeperDnsWatcher: synapse: stopping watcher mongo using default stop handler
W, [2016-02-18T16:58:00.933561 #8263]  WARN -- Synapse::ZookeeperDnsWatcher::Zookeeper: synapse: zookeeper watcher exiting
I, [2016-02-18T16:58:00.934565 #8263]  INFO -- Synapse::ZookeeperDnsWatcher::Zookeeper: synapse: zookeeper watcher cleaning up
I, [2016-02-18T16:58:00.934961 #8263]  INFO -- Synapse::ZookeeperDnsWatcher::Zookeeper: synapse: closing zk connection to 172.16.151.82:2181,172.16.151.86:2181,172.16.151.87:2181
I, [2016-02-18T16:58:00.937347 #8263]  INFO -- Synapse::ZookeeperDnsWatcher::Zookeeper: synapse: zookeeper watcher cleaned up successfully
/var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:54:in `block (2 levels) in run': synapse: service watcher mongo failed ping! (RuntimeError)
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:53:in `each'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:53:in `block in run'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:52:in `loop'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/lib/synapse.rb:52:in `run'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/lib/ruby/gems/2.2.0/gems/synapse-0.12.1/bin/synapse:60:in `<top (required)>'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/.bin/synapse:23:in `load'
    from /var/lib/mesos/slaves/20160213-002027-1469517996-5050-555-S4/frameworks/20160204-212329-1385631916-5050-163-0000/executors/thermos-1455730435744-bright-devel-authserve-0-6e52c164-8fcb-4e00-bc33-62523f32ad4c/runs/4810df56-7a1d-4bd1-9c2c-84f2c041a181/sandbox/synapse/.bin/synapse:23:in `<main>'
@igor47
Copy link
Collaborator

igor47 commented Feb 18, 2016

the synapse philosophy is fail-fast; i would be opposed to allowing a dead watcher to stick around.

why did your cluster collapse? haproxy should have continued running with the last known good configuration. it should be fine for synapse to bail -- it should only break discovery of added/removed backends.

we should probably make the airbnb.com thing configurable to the watcher. i would welcome a PR to do that.

@JustinVenus
Copy link

Mesos (thermos_executer from Apache/Aurora) is supervising synapse ... when synapse exited on a single failed watcher (due to a resolver issue caused by a network interruption) all jobs in the Cgroup went with it. The Mesos cluster collapsed, because of a DNS interruption as we our DNS servers were unable to forward a lookup for airbnb.com (probably also a risk if the Airbnb resolvers ever failed).

We use synapse/haproxy only where it is absolutely required (effectively a side car process). We do this, because we don't want to health check all of our services from every machine in the fleet.

Our short term fix to not use the DNS resolver at all. I would prefer to treat a DNS failure as a transient event and retry until the records have at least hit the TTL before bailing out of the whole process.

Totally open to hashing out a design that works for both of us.

@igor47
Copy link
Collaborator

igor47 commented Feb 18, 2016

a transient dns failure is not a risk for us, because synapse exiting (even exiting on every machine at once) is not a risk for us. likely, things would recover on their own, since we use a process supervisor (runit) which would just restart synapse. worst case, we would require manual intervention. either way, this scenario would not immediately threaten the stability of airbnb.

i am not familiar with the technologies you're using, but i recommend attempting to create a similar configuration. a single component failing should never take down your entire infrastructure.

@jolynch
Copy link
Collaborator

jolynch commented Feb 19, 2016

@JustinVenus

If you're running Synapse inside a container as a sidecar, you can run it under a supervisor and everything should work out. Ideally since you have multiple processes in a single container you should be using a proper init replacement anyways (e.g.runit, supervisord, etc).

That being said, best practice is to run Synapse and HAProxy outside your containers. We do this at Yelp (where we uae mesos/marathon) and it's worked wonderfully. HAProxy binds to a link local IP and all containers can talk to it. This also makes the system much more scalable because you don't have hundreds (or sometimes thousands) of connections to your service registry from every host.

@jolynch
Copy link
Collaborator

jolynch commented Feb 19, 2016

Also if healthchecking is a concern you can just turn off healthchecking except on error or just turn it off all together.

We leave healthchecking on at a low rate (leveraging on-error and fast-inter to make the rate high on error) and use hacheck (many use varnish) to soak up the healthcheck traffic. This is extra nice because we can gracefully fail boxes out of the SmartStack to do things like docker upgrades or system reboots. This architecture has scaled very well for an infra of many thousands of machines in many datacenters running hundreds of services that are constantly moving around.

@JustinVenus
Copy link

@jolynch We are running synapse as process as part of an Aurora Task. The issue is if one process exits != 0 then the task group is considered a failure and the whole job fails. Normally this isn't a big deal. However synapse/haproxy is bailing out on a single DNS resolve failure, and this is across the fleet causing mass rescheduling of Mesos jobs (thundering herd). I'm going to post a patch in a bit to handle the exception gracefully. I think it can be handled with a 3 line change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants