Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows nodes can't reach service network #77

Closed
oe-hbk opened this issue May 4, 2020 · 27 comments
Closed

Windows nodes can't reach service network #77

oe-hbk opened this issue May 4, 2020 · 27 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@oe-hbk
Copy link

oe-hbk commented May 4, 2020

I have the following setup
Windows 2019-1909
Kubernetes 1.18.2
Control Plane: CentOS 7.7 with k8s 1.18.2 built with kubedm
CNI: flannel with vxlan, using the proper vxlan ID and UDP ports for Windows compatibility

I followed PrepareNode.ps1 script here to get the 1909 server ready, but had to build my own kube-proxy and kube-flannel windows images as those don't support 1909. I had to build the setup.exe from another system and just ADD it into the container as there isn't a golang:servercore1909 image to use as the build image.

I've followed the instructions at https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/adding-windows-nodes/ to get everything up and running. I can successfully get a pod containing a servercore 1909 image. When I exec into this POD, I can ping all the linux cluster node IPs just fine. Route tables look accurate

However, when I try to reach the service network, including coredns, my connections time out. So I can't do DNS lookups whatsoever.

I can reach outside my cluster fine as well (nslookup using our physical DNS server IP addresses)

The only thing that doesn't seem to be working is service network connectivity. the node does have a proper route to the service network, and I can see \etc\cni\net.d\10-flannel.conf have the correct ExceptionList for OutBoundNAT that covers both the service network and pod networks, and also have a ROUTE type endpointpolicy with destination set to the service network with NeedEncap: true

@oe-hbk
Copy link
Author

oe-hbk commented May 5, 2020

I found a solution to this, which was to add --ip-masq arg to flanneld.exe Is there a reason this wouldn't be the default, when the linux default is to use it?

@neolit123
Copy link
Member

I found a solution to this, which was to add --ip-masq arg to flanneld.exe Is there a reason this wouldn't be the default, when the linux default is to use it?

@ksubrmnn @benmoss do you happen to know?

@ksubrmnn
Copy link
Contributor

ksubrmnn commented May 5, 2020

It was included in the old scripts. Probably just got lost in the new version. Feel free to make a PR with the fix!

@oe-hbk
Copy link
Author

oe-hbk commented May 6, 2020

Will create a PR. It seems odd to add to the commandline in the run.ps1. I will try to find a better way that is similar to the default linux kube-flannel.yml

@benmoss
Copy link
Contributor

benmoss commented May 6, 2020

Hmm, curious that this hasn't been a problem for us so far. AFAIK service IPs work today, I would guess that this is covered by one of the conformance tests that is passing for us: https://k8s-testgrid.appspot.com/sig-windows#kubeadm-windows-gcp-k8s-stable

@gimlichael
Copy link

gimlichael commented May 29, 2020

I found a solution to this, which was to add --ip-masq arg to flanneld.exe Is there a reason this wouldn't be the default, when the linux default is to use it?

I too am having connection issues when assigning a service with type LoadBalancer. I tried your suggested fix with applying ip-masq (wins cli process run --path /k/flannel/flanneld.exe --args "--ip-masq --kube-subnet-mgr --kubeconfig-file /k/flannel/kubeconfig.yml" --envs "POD_NAME=$env:POD_NAME POD_NAMESPACE=$env:POD_NAMESPACE"), however i am unsuccessfull in getting it to work.

From my Linux nodes services works as expected; from Windows there are only connectivity from within the cluster; any external ip's is not getting routed for some reason.

Any help is appreciated; have looked into this issue for days now ..

Runnung K8S 1.18.3, Flannel 0.12 and custom images to support 1909 images. Network is host-gw/l2bridge, as i have never have success with vxlan.

@gimlichael
Copy link

As a workaround, i setup an Ingress for one of the services; then it works fine (i guess it is because ingress is hosted on Linux - and the connection inside the cluster works fine). However, the other ports i have opened up for cannot be assigned to ingress as they are non-http protocols.

@gimlichael
Copy link

As yet another workaround, for non-http protocols, I had to remove the LoadBalancer type and op-in for NodePort. Then i needed to re-configure my router, to translate a normal port to a node-port.

Do note, that the two workarounds is only when talking Windows nodes; Linux nodes works as expected.

I really hope the SIG team will start to investigate these issues further.

Thanks.

@stef007a
Copy link

What for me works, if all the pods are up and running on the Windows Node

I do the following

Stop-Service kubelet
Start-Sleep -s 10
Stop-Service docker
Start-Service docker
Start-Service kubelet

Then the hole networking are working, LoadBalancer works, ping in the pods works to al locations.
I'am searching for a good solution ..The same issue i get with the Microsofts docs setup for a node.

@ksubrmnn
Copy link
Contributor

@JocelynBerrendonner FYI

@JocelynBerrendonner
Copy link

Note that we added a couple of Kube-Proxy fixes for Windows hosts a couple weeks ago that can lead to the symptoms described in this issue:
kubernetes/kubernetes#91886
kubernetes/kubernetes#91706

Can you re-try with a Kube-Proxy version that has these fixes, please?

@sbangari and @Keith-Mange, FYI

@stef007a
Copy link

@JocelynBerrendonner

Today tested with v1.18.4, no difference.

@JocelynBerrendonner
Copy link

@Stefanbs23 : Thank you for trying it! Unfortunately v1.18.x don't seem to have these changes. That said, the changes should theoretically be in the next v1.19.x release.

@masaeedu
Copy link

@JocelynBerrendonner Sorry, I'm fairly new to Kubernetes, so I'm probably misunderstanding how this works. Isn't it something about the routing (which afaik is governed by the CNI plugin) or the kube-proxy instance on the master node (which is Linux) that would be causing the problem? In other words, isn't the Windows kube-proxy instance only responsible for dealing with services whose pods are scheduled on Windows (which is not the case with coredns)?

@masaeedu
Copy link

masaeedu commented Aug 10, 2020

Welp, empirically at least I'm wrong. I looked at kubectl get pods -o wide -n=kube-system and observed that the kube-proxy instance on Windows was in the state "CrashLoopBackoff". After I deleted the pod and it was rescheduled and entered the running state, name resolution magically started working inside Windows containers.

I guess I need to do more research on how network traffic flows through the system with the "Service" concept in Kubernetes.

@JocelynBerrendonner
Copy link

@JocelynBerrendonner Sorry, I'm fairly new to Kubernetes, so I'm probably misunderstanding how this works. Isn't it something about the routing (which afaik is governed by the CNI plugin) or the kube-proxy instance on the master node (which is Linux) that would be causing the problem? In other words, isn't the Windows kube-proxy instance only responsible for dealing with services whose pods are scheduled on Windows (which is not the case with coredns)?

@masaeedu : this bug indeed tracks a Windows issue and has nothing to do with Linux. Each instance of Kube-Proxy is responsible for plumbing services connectivity on the node it runs on. The problem you describe seems unrelated to this bug. Let's open a different issue to track it!

@llyons
Copy link

llyons commented Oct 27, 2020

I found a solution to this, which was to add --ip-masq arg to flanneld.exe Is there a reason this wouldn't be the default, when the linux default is to use it?

I too am having connection issues when assigning a service with type LoadBalancer. I tried your suggested fix with applying ip-masq (wins cli process run --path /k/flannel/flanneld.exe --args "--ip-masq --kube-subnet-mgr --kubeconfig-file /k/flannel/kubeconfig.yml" --envs "POD_NAME=$env:POD_NAME POD_NAMESPACE=$env:POD_NAMESPACE"), however i am unsuccessfull in getting it to work.

From my Linux nodes services works as expected; from Windows there are only connectivity from within the cluster; any external ip's is not getting routed for some reason.

Any help is appreciated; have looked into this issue for days now ..

Runnung K8S 1.18.3, Flannel 0.12 and custom images to support 1909 images. Network is host-gw/l2bridge, as i have never have success with vxlan.

I am having this same issue #103

not sure how to resolve this. I have tried a number of things

@Celthi
Copy link

Celthi commented Dec 3, 2020

I'm having the same issue with the master node (Linux, v1.19.3) and worker node (windows v1.19.0)

@JocelynBerrendonner
Copy link

@sbangari: heads up

@Celthi
Copy link

Celthi commented Dec 5, 2020

I cannot reach the service in the Windows node from the Linux node.

[root@tec-l-014627 admin]# kubectl get pods -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE           NOMINATED NODE   READINESS GATES
curl-7ff5cb5cb7-sxvvn            1/1     Running   0          6s    10.244.0.86   tec-l-014627   <none>           <none>
win-webserver-7d6d8c7f79-vx24n   1/1     Running   0          86s   10.244.2.26   tec-w-013246   <none>           <none>

[root@tec-l-014627 admin]# kubectl exec -ti curl-7ff5cb5cb7-sxvvn -- /bin/sh
[ root@curl-7ff5cb5cb7-sxvvn:/ ]$ curl win-webserver
^C
[ root@curl-7ff5cb5cb7-sxvvn:/ ]$ curl 10.244.2.26
^C
[ root@curl-7ff5cb5cb7-sxvvn:/ ]$ curl 10.244.2.26
curl: (7) Failed to connect to 10.244.2.26 port 80: Connection timed out

[root@tec-l-014627 admin]# kubectl get svc
NAME            TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)        AGE
kubernetes      ClusterIP   10.96.0.1        <none>        443/TCP        3d13h
my-nginx        ClusterIP   10.105.250.101   <none>        80/TCP         2d1h
win-webserver   NodePort    10.98.131.225    <none>        80:31381/TCP   6s
[ root@curl-7ff5cb5cb7-sxvvn:/ ]$ curl 10.98.131.225
curl: (7) Failed to connect to 10.98.131.225 port 80: Connection timed out


PS C:\Users\admin> curl 10.98.131.225


StatusCode        : 200
StatusDescription : OK
Content           :
RawContent        : HTTP/1.1 200 OK
                    Transfer-Encoding: chunked
                    Content-Type: text/html
                    Date: Sat, 05 Dec 2020 04:31:19 GMT
                    Server: Microsoft-HTTPAPI/2.0


Forms             : {}
Headers           : {[Transfer-Encoding, chunked], [Content-Type, text/html], [Date, Sat, 05 Dec 2020 04:31:19 GMT],
                    [Server, Microsoft-HTTPAPI/2.0]}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 0


@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 5, 2021
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Apr 4, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@joaoestevinho
Copy link

joaoestevinho commented May 13, 2021

Having the same issue on a deployment with flannel vxlan using Kubernetes 1.21 on both the control plane (Linux) and the node (Windows Server Core 2004).

I can send data between Pods in the Windows node but have no connectivity from these pods to to the outside or any service IP hosted in either the Linux or Windows nodes.

@k8s-ci-robot
Copy link
Contributor

@joaoestevinho: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Having the same issue on a deployment with flannel vxlan using Kubernetes 1.21 on both the control plane (Linux) and the node (Windows Server Core 2004).

I can send data between Pods in the Windows node but have no connectivity from these pods to to the outside or any service IP hosted in either the Linux or Windows nodes.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@natejgardner
Copy link

This seems to still be an active problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

Successfully merging a pull request may close this issue.