-
Notifications
You must be signed in to change notification settings - Fork 251
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DnsWatcher < BaseWatcher #172
Comments
the synapse philosophy is fail-fast; i would be opposed to allowing a dead watcher to stick around. why did your cluster collapse? haproxy should have continued running with the last known good configuration. it should be fine for synapse to bail -- it should only break discovery of added/removed backends. we should probably make the |
Mesos (thermos_executer from Apache/Aurora) is supervising synapse ... when synapse exited on a single failed watcher (due to a resolver issue caused by a network interruption) all jobs in the Cgroup went with it. The Mesos cluster collapsed, because of a DNS interruption as we our DNS servers were unable to forward a lookup for airbnb.com (probably also a risk if the Airbnb resolvers ever failed). We use synapse/haproxy only where it is absolutely required (effectively a side car process). We do this, because we don't want to health check all of our services from every machine in the fleet. Our short term fix to not use the DNS resolver at all. I would prefer to treat a DNS failure as a transient event and retry until the records have at least hit the TTL before bailing out of the whole process. Totally open to hashing out a design that works for both of us. |
a transient dns failure is not a risk for us, because synapse exiting (even exiting on every machine at once) is not a risk for us. likely, things would recover on their own, since we use a process supervisor (runit) which would just restart synapse. worst case, we would require manual intervention. either way, this scenario would not immediately threaten the stability of airbnb. i am not familiar with the technologies you're using, but i recommend attempting to create a similar configuration. a single component failing should never take down your entire infrastructure. |
If you're running Synapse inside a container as a sidecar, you can run it under a supervisor and everything should work out. Ideally since you have multiple processes in a single container you should be using a proper init replacement anyways (e.g.runit, supervisord, etc). That being said, best practice is to run Synapse and HAProxy outside your containers. We do this at Yelp (where we uae mesos/marathon) and it's worked wonderfully. HAProxy binds to a link local IP and all containers can talk to it. This also makes the system much more scalable because you don't have hundreds (or sometimes thousands) of connections to your service registry from every host. |
Also if healthchecking is a concern you can just turn off healthchecking except on error or just turn it off all together. We leave healthchecking on at a low rate (leveraging on-error and fast-inter to make the rate high on error) and use hacheck (many use varnish) to soak up the healthcheck traffic. This is extra nice because we can gracefully fail boxes out of the SmartStack to do things like docker upgrades or system reboots. This architecture has scaled very well for an infra of many thousands of machines in many datacenters running hundreds of services that are constantly moving around. |
@jolynch We are running synapse as process as part of an Aurora Task. The issue is if one process exits != 0 then the task group is considered a failure and the whole job fails. Normally this isn't a big deal. However synapse/haproxy is bailing out on a single DNS resolve failure, and this is across the fleet causing mass rescheduling of Mesos jobs (thundering herd). I'm going to post a patch in a bit to handle the exception gracefully. I think it can be handled with a 3 line change. |
We were trying out this service at our organization in an internal datacenter and we temporarily lost internet connection. Because this is resolving the external address of airbnb being required as part of 'ping?' it caused synapse to fail entirely. Perhaps throwing a recoverable error rather than a runtime exception would help prevent the entire cluster from collapsing.
The text was updated successfully, but these errors were encountered: