Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dnsrr discovery method does not work when "healthcheck" used #47

Open
mnoky opened this issue Jul 30, 2019 · 6 comments
Open

dnsrr discovery method does not work when "healthcheck" used #47

mnoky opened this issue Jul 30, 2019 · 6 comments

Comments

@mnoky
Copy link

mnoky commented Jul 30, 2019

Great project, I'm excited to get this working for a service I have deployed in a docker cluster! Currently testing 1.0-RC14 and I've hit the following snag:

The dnsrr discovery method does not work when a docker "healthcheck" is used. Reason being: during startup, the service name cannot be resolved. The name is not available until after the healthcheck succeeds and the service is up and running. Thus, it is a bit of a chicken-and-egg problem. The following exception is thrown at startup and the service cannot start (only relevant lines shown)

Caused by: org.springframework.beans.BeanInstantiationException: Failed to instantiate [com.hazelcast.core.HazelcastInstance]: Factory method 'hazelcastInstance' threw exception; nested exception is com.hazelcast.config.ConfigurationException: Cannot create a new instance of MemberAddressProvider 'class org.bitsofinfo.hazelcast.spi.docker.swarm.dnsrr.DockerDNSRRMemberAddressProvider'
...
Caused by: com.hazelcast.config.ConfigurationException: Cannot create a new instance of MemberAddressProvider 'class org.bitsofinfo.hazelcast.spi.docker.swarm.dnsrr.DockerDNSRRMemberAddressProvider'
...
    at com.hazelcast.instance.DefaultNodeContext.newMemberAddressProviderInstance(DefaultNodeContext.java:94)
    ... 63 more
    Caused by: java.net.UnknownHostException: my_service: Name or service not known
...
    at org.bitsofinfo.hazelcast.spi.docker.swarm.dnsrr.DockerDNSRRMemberAddressProvider.resolveServiceName(DockerDNSRRMemberAddressProvider.java:130)

When I disable the healthcheck for my service, the dns resolution works right away and there are no problems.

Is it possible to delay the dns lookup in DockerDNSRRMemberAddressProvider? Or does it need to be available right away?

@mnoky
Copy link
Author

mnoky commented Jul 30, 2019

Looks like others have encountered this problem as well:

moby/moby#35451

@bitsofinfo
Copy link
Owner

I don't think there is a way to do this out of the box. I think @Cardds would have to add an option for some kind of artificial sleep for such a thing, but I'm not sure even that would be reliable. @Cardds ?

@mnoky
Copy link
Author

mnoky commented Jul 30, 2019

I haven't yet tried the DockerSwarmDiscoveryStrategy + SwarmMemberAddressProvider solution. I'm guessing the use of healthcheck will also be problematic here... Do you know offhand if this would be the case?

@bitsofinfo
Copy link
Owner

That method uses the actual swarm APIs to discover peers, so its not reliant on the auto-generated swarm peer level host/dns like DockerDNSRRMemberAddressProvider method. So it should work.

@bitsofinfo
Copy link
Owner

btw @mnoky, on that moby issue, I highly doubt that issue will ever be resolved. They've seemingly abandoned swarm to minimal maintenance mode at this point.

@vinsgithub
Copy link

Hi @bitsofinfo @mnoky, I've found a workaround for that in my scenario (not necessarily covers all) and I hope could help someone.

Little notice about swarm:
Swarm is not dead and still maintained. In some part also evolved by Mirantis because lots of companies are still using it. After 2019 many things have changed, sure, but swarm is still out there for those who don't need kubernetes and cloud services in general.

To overcome the initialization problem in my springboot (jhipster) microservice using your awesome hazelcast-docker-swarm solution, I've set this in my docker-compose:

   healthcheck:
      test: (echo 'exit' | curl -v telnet://localhost:8082 2>&1 | grep -c refused > /dev/null) || (curl -sS http://localhost:8082/management/health | grep -c UP > /dev/null)
      interval: 5s
      timeout: 30s
      retries: 4
      start_period: 3s #must be less than JHIPSTER_SLEEP

The rational behind this is that application needs to resolve docker service name during startup but swarm healthcheck does not allow it until healthcheck itself is ok. So we first allow healthcheck to be initially ok if local service port (8082) is refusing connection (application is starting) but as soon as local port is responding, healthcheck with test the real application check output.
It's not ideal but it's a good compromise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants