Windows nodes can't reach service network #77

oe-hbk · 2020-05-04T19:49:20Z

I have the following setup
Windows 2019-1909
Kubernetes 1.18.2
Control Plane: CentOS 7.7 with k8s 1.18.2 built with kubedm
CNI: flannel with vxlan, using the proper vxlan ID and UDP ports for Windows compatibility

I followed PrepareNode.ps1 script here to get the 1909 server ready, but had to build my own kube-proxy and kube-flannel windows images as those don't support 1909. I had to build the setup.exe from another system and just ADD it into the container as there isn't a golang:servercore1909 image to use as the build image.

I've followed the instructions at https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/adding-windows-nodes/ to get everything up and running. I can successfully get a pod containing a servercore 1909 image. When I exec into this POD, I can ping all the linux cluster node IPs just fine. Route tables look accurate

However, when I try to reach the service network, including coredns, my connections time out. So I can't do DNS lookups whatsoever.

I can reach outside my cluster fine as well (nslookup using our physical DNS server IP addresses)

The only thing that doesn't seem to be working is service network connectivity. the node does have a proper route to the service network, and I can see \etc\cni\net.d\10-flannel.conf have the correct ExceptionList for OutBoundNAT that covers both the service network and pod networks, and also have a ROUTE type endpointpolicy with destination set to the service network with NeedEncap: true

oe-hbk · 2020-05-05T19:23:30Z

I found a solution to this, which was to add --ip-masq arg to flanneld.exe Is there a reason this wouldn't be the default, when the linux default is to use it?

neolit123 · 2020-05-05T20:12:34Z

I found a solution to this, which was to add --ip-masq arg to flanneld.exe Is there a reason this wouldn't be the default, when the linux default is to use it?

@ksubrmnn @benmoss do you happen to know?

ksubrmnn · 2020-05-05T21:13:49Z

It was included in the old scripts. Probably just got lost in the new version. Feel free to make a PR with the fix!

oe-hbk · 2020-05-06T13:07:33Z

Will create a PR. It seems odd to add to the commandline in the run.ps1. I will try to find a better way that is similar to the default linux kube-flannel.yml

benmoss · 2020-05-06T13:33:45Z

Hmm, curious that this hasn't been a problem for us so far. AFAIK service IPs work today, I would guess that this is covered by one of the conformance tests that is passing for us: https://k8s-testgrid.appspot.com/sig-windows#kubeadm-windows-gcp-k8s-stable

gimlichael · 2020-05-29T00:14:29Z

I found a solution to this, which was to add --ip-masq arg to flanneld.exe Is there a reason this wouldn't be the default, when the linux default is to use it?

I too am having connection issues when assigning a service with type LoadBalancer. I tried your suggested fix with applying ip-masq (wins cli process run --path /k/flannel/flanneld.exe --args "--ip-masq --kube-subnet-mgr --kubeconfig-file /k/flannel/kubeconfig.yml" --envs "POD_NAME=$env:POD_NAME POD_NAMESPACE=$env:POD_NAMESPACE"), however i am unsuccessfull in getting it to work.

From my Linux nodes services works as expected; from Windows there are only connectivity from within the cluster; any external ip's is not getting routed for some reason.

Any help is appreciated; have looked into this issue for days now ..

Runnung K8S 1.18.3, Flannel 0.12 and custom images to support 1909 images. Network is host-gw/l2bridge, as i have never have success with vxlan.

gimlichael · 2020-05-29T00:17:50Z

As a workaround, i setup an Ingress for one of the services; then it works fine (i guess it is because ingress is hosted on Linux - and the connection inside the cluster works fine). However, the other ports i have opened up for cannot be assigned to ingress as they are non-http protocols.

gimlichael · 2020-05-30T15:38:23Z

As yet another workaround, for non-http protocols, I had to remove the LoadBalancer type and op-in for NodePort. Then i needed to re-configure my router, to translate a normal port to a node-port.

Do note, that the two workarounds is only when talking Windows nodes; Linux nodes works as expected.

I really hope the SIG team will start to investigate these issues further.

Thanks.

stef007a · 2020-06-19T14:41:34Z

What for me works, if all the pods are up and running on the Windows Node

I do the following

Stop-Service kubelet
Start-Sleep -s 10
Stop-Service docker
Start-Service docker
Start-Service kubelet

Then the hole networking are working, LoadBalancer works, ping in the pods works to al locations.
I'am searching for a good solution ..The same issue i get with the Microsofts docs setup for a node.

ksubrmnn · 2020-06-19T18:12:18Z

@JocelynBerrendonner FYI

JocelynBerrendonner · 2020-06-19T18:31:44Z

Note that we added a couple of Kube-Proxy fixes for Windows hosts a couple weeks ago that can lead to the symptoms described in this issue:
kubernetes/kubernetes#91886
kubernetes/kubernetes#91706

Can you re-try with a Kube-Proxy version that has these fixes, please?

@sbangari and @Keith-Mange, FYI

stef007a · 2020-06-20T11:18:54Z

@JocelynBerrendonner

Today tested with v1.18.4, no difference.

JocelynBerrendonner · 2020-06-22T15:57:06Z

@Stefanbs23 : Thank you for trying it! Unfortunately v1.18.x don't seem to have these changes. That said, the changes should theoretically be in the next v1.19.x release.

masaeedu · 2020-08-10T05:56:53Z

@JocelynBerrendonner Sorry, I'm fairly new to Kubernetes, so I'm probably misunderstanding how this works. Isn't it something about the routing (which afaik is governed by the CNI plugin) or the kube-proxy instance on the master node (which is Linux) that would be causing the problem? In other words, isn't the Windows kube-proxy instance only responsible for dealing with services whose pods are scheduled on Windows (which is not the case with coredns)?

masaeedu · 2020-08-10T05:59:31Z

Welp, empirically at least I'm wrong. I looked at kubectl get pods -o wide -n=kube-system and observed that the kube-proxy instance on Windows was in the state "CrashLoopBackoff". After I deleted the pod and it was rescheduled and entered the running state, name resolution magically started working inside Windows containers.

I guess I need to do more research on how network traffic flows through the system with the "Service" concept in Kubernetes.

JocelynBerrendonner · 2020-08-10T16:45:09Z

@JocelynBerrendonner Sorry, I'm fairly new to Kubernetes, so I'm probably misunderstanding how this works. Isn't it something about the routing (which afaik is governed by the CNI plugin) or the kube-proxy instance on the master node (which is Linux) that would be causing the problem? In other words, isn't the Windows kube-proxy instance only responsible for dealing with services whose pods are scheduled on Windows (which is not the case with coredns)?

@masaeedu : this bug indeed tracks a Windows issue and has nothing to do with Linux. Each instance of Kube-Proxy is responsible for plumbing services connectivity on the node it runs on. The problem you describe seems unrelated to this bug. Let's open a different issue to track it!

llyons · 2020-10-27T20:29:43Z

I found a solution to this, which was to add --ip-masq arg to flanneld.exe Is there a reason this wouldn't be the default, when the linux default is to use it?

I too am having connection issues when assigning a service with type LoadBalancer. I tried your suggested fix with applying ip-masq (wins cli process run --path /k/flannel/flanneld.exe --args "--ip-masq --kube-subnet-mgr --kubeconfig-file /k/flannel/kubeconfig.yml" --envs "POD_NAME=$env:POD_NAME POD_NAMESPACE=$env:POD_NAMESPACE"), however i am unsuccessfull in getting it to work.

From my Linux nodes services works as expected; from Windows there are only connectivity from within the cluster; any external ip's is not getting routed for some reason.

Any help is appreciated; have looked into this issue for days now ..

Runnung K8S 1.18.3, Flannel 0.12 and custom images to support 1909 images. Network is host-gw/l2bridge, as i have never have success with vxlan.

I am having this same issue #103

not sure how to resolve this. I have tried a number of things

Celthi · 2020-12-03T05:49:45Z

I'm having the same issue with the master node (Linux, v1.19.3) and worker node (windows v1.19.0)

JocelynBerrendonner · 2020-12-03T23:42:52Z

@sbangari: heads up

Celthi · 2020-12-05T04:25:55Z

I cannot reach the service in the Windows node from the Linux node.

[root@tec-l-014627 admin]# kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
curl-7ff5cb5cb7-sxvvn            1/1     Running   0          6s    10.244.0.86   tec-l-014627   <none>           <none>
win-webserver-7d6d8c7f79-vx24n   1/1     Running   0          86s   10.244.2.26   tec-w-013246   <none>           <none>

[root@tec-l-014627 admin]# kubectl exec -ti curl-7ff5cb5cb7-sxvvn -- /bin/sh
[ root@curl-7ff5cb5cb7-sxvvn:/ ]$ curl win-webserver
^C
[ root@curl-7ff5cb5cb7-sxvvn:/ ]$ curl 10.244.2.26
^C
[ root@curl-7ff5cb5cb7-sxvvn:/ ]$ curl 10.244.2.26
curl: (7) Failed to connect to 10.244.2.26 port 80: Connection timed out

[root@tec-l-014627 admin]# kubectl get svc
NAME            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kubernetes      ClusterIP   10.96.0.1        <none>        443/TCP        3d13h
my-nginx        ClusterIP   10.105.250.101   <none>        80/TCP         2d1h
win-webserver   NodePort    10.98.131.225    <none>        80:31381/TCP   6s
[ root@curl-7ff5cb5cb7-sxvvn:/ ]$ curl 10.98.131.225
curl: (7) Failed to connect to 10.98.131.225 port 80: Connection timed out


PS C:\Users\admin> curl 10.98.131.225


StatusCode        : 200
StatusDescription : OK
Content           :
RawContent        : HTTP/1.1 200 OK
                    Transfer-Encoding: chunked
                    Content-Type: text/html
                    Date: Sat, 05 Dec 2020 04:31:19 GMT
                    Server: Microsoft-HTTPAPI/2.0


Forms             : {}
Headers           : {[Transfer-Encoding, chunked], [Content-Type, text/html], [Date, Sat, 05 Dec 2020 04:31:19 GMT],
                    [Server, Microsoft-HTTPAPI/2.0]}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 0

fejta-bot · 2021-03-05T04:48:28Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

fejta-bot · 2021-04-04T05:32:07Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-05-04T05:36:11Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-05-04T05:36:16Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

joaoestevinho · 2021-05-13T11:07:18Z

Having the same issue on a deployment with flannel vxlan using Kubernetes 1.21 on both the control plane (Linux) and the node (Windows Server Core 2004).

I can send data between Pods in the Windows node but have no connectivity from these pods to to the outside or any service IP hosted in either the Linux or Windows nodes.

k8s-ci-robot · 2021-05-13T11:07:23Z

@joaoestevinho: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Having the same issue on a deployment with flannel vxlan using Kubernetes 1.21 on both the control plane (Linux) and the node (Windows Server Core 2004).

I can send data between Pods in the Windows node but have no connectivity from these pods to to the outside or any service IP hosted in either the Linux or Windows nodes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

natejgardner · 2021-12-30T00:28:00Z

This seems to still be an active problem.

oe-hbk mentioned this issue May 9, 2020

Add back --ip-masq cmdline argument to flanneld. #80

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 4, 2021

k8s-ci-robot closed this as completed May 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows nodes can't reach service network #77

Windows nodes can't reach service network #77

oe-hbk commented May 4, 2020

oe-hbk commented May 5, 2020

neolit123 commented May 5, 2020

ksubrmnn commented May 5, 2020

oe-hbk commented May 6, 2020

benmoss commented May 6, 2020

gimlichael commented May 29, 2020 •

edited

Loading

gimlichael commented May 29, 2020

gimlichael commented May 30, 2020

stef007a commented Jun 19, 2020

ksubrmnn commented Jun 19, 2020

JocelynBerrendonner commented Jun 19, 2020

stef007a commented Jun 20, 2020

JocelynBerrendonner commented Jun 22, 2020

masaeedu commented Aug 10, 2020

masaeedu commented Aug 10, 2020 •

edited

Loading

JocelynBerrendonner commented Aug 10, 2020

llyons commented Oct 27, 2020

Celthi commented Dec 3, 2020

JocelynBerrendonner commented Dec 3, 2020

Celthi commented Dec 5, 2020 •

edited

Loading

fejta-bot commented Mar 5, 2021

fejta-bot commented Apr 4, 2021

fejta-bot commented May 4, 2021

k8s-ci-robot commented May 4, 2021

joaoestevinho commented May 13, 2021 •

edited

Loading

k8s-ci-robot commented May 13, 2021

natejgardner commented Dec 30, 2021

Windows nodes can't reach service network #77

Windows nodes can't reach service network #77

Comments

oe-hbk commented May 4, 2020

oe-hbk commented May 5, 2020

neolit123 commented May 5, 2020

ksubrmnn commented May 5, 2020

oe-hbk commented May 6, 2020

benmoss commented May 6, 2020

gimlichael commented May 29, 2020 • edited Loading

gimlichael commented May 29, 2020

gimlichael commented May 30, 2020

stef007a commented Jun 19, 2020

ksubrmnn commented Jun 19, 2020

JocelynBerrendonner commented Jun 19, 2020

stef007a commented Jun 20, 2020

JocelynBerrendonner commented Jun 22, 2020

masaeedu commented Aug 10, 2020

masaeedu commented Aug 10, 2020 • edited Loading

JocelynBerrendonner commented Aug 10, 2020

llyons commented Oct 27, 2020

Celthi commented Dec 3, 2020

JocelynBerrendonner commented Dec 3, 2020

Celthi commented Dec 5, 2020 • edited Loading

fejta-bot commented Mar 5, 2021

fejta-bot commented Apr 4, 2021

fejta-bot commented May 4, 2021

k8s-ci-robot commented May 4, 2021

joaoestevinho commented May 13, 2021 • edited Loading

k8s-ci-robot commented May 13, 2021

natejgardner commented Dec 30, 2021

gimlichael commented May 29, 2020 •

edited

Loading

masaeedu commented Aug 10, 2020 •

edited

Loading

Celthi commented Dec 5, 2020 •

edited

Loading

joaoestevinho commented May 13, 2021 •

edited

Loading