Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Submariner doesn't work on Linode k8s clusters #2660

Closed
Jdelachi opened this issue Aug 25, 2023 · 16 comments
Closed

Submariner doesn't work on Linode k8s clusters #2660

Jdelachi opened this issue Aug 25, 2023 · 16 comments
Assignees
Labels
bug Something isn't working Calico need-info

Comments

@Jdelachi
Copy link

Jdelachi commented Aug 25, 2023

What happened:
I have tried to configure Submariner to establish connectivity between 2 Linode k8s clusters, but the connectivity is not successful.

What you expected to happen:
I expect to be able to establish connectivity between two Linode k8s clusters

How to reproduce it (as minimally and precisely as possible):

  1. Create 2 Linode k8s cluster
  2. Install Calico API server on both clusters (https://docs.tigera.io/calico/latest/operations/install-apiserver)
  3. On cluster-a: subctl deploy-broker --globalnet --globalnet-cidr-range=240.0.0.0/8
  4. On Cluster-a: subctl join broker-info.subm --clusterid cluster-a --check-broker-certificate=false --clustercidr 10.2.0.0/16 --servicecidr 10.128.0.0/16
  5. On Cluster-b: subctl join broker-info.subm --clusterid cluster-b --check-broker-certificate=false --clustercidr 10.2.0.0/16 --servicecidr 10.128.0.0/16
  6. Follow this official example to test connectivity: https://submariner.io/getting-started/quickstart/openshift/globalnet/
  7. I get a timeout executing: curl nginx.default.svc.clusterset.local:8080

Anything else we need to know?:
In Linode the k8s cluster always has the same podCIDR and servicesCIDR.
podCIDR -> 10.2.0.0/16
servicesCIDR-> 10.128.0.0/16

Environment:

  • Diagnose information (use subctl diagnose all):

Cluster "lke126869"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.26.7" is supported

✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
✓ Clusters do not have overlapping globalnet CIDRs
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-globalnet"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking if gateway metrics are accessible from non-gateway nodes
✓ The gateway metrics are accessible
✓ Checking if globalnet metrics are accessible from non-gateway nodes
✓ The globalnet metrics are accessible

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ All connections are established
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✓ The firewall configuration allows intra-cluster VXLAN traffic
✓ Checking Globalnet configuration
✓ Globalnet is properly configured and functioning

✓ Checking if services have been exported properly
✓ All services have been exported properly

Cluster "lke126870"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.26.7" is supported

✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
✓ Clusters do not have overlapping globalnet CIDRs
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-globalnet"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking if gateway metrics are accessible from non-gateway nodes
✓ The gateway metrics are accessible
✓ Checking if globalnet metrics are accessible from non-gateway nodes
✓ The globalnet metrics are accessible

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ All connections are established
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✗ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✗ The tcpdump output from the sniffer pod does not contain the expected remote endpoint IP 240.0.0.0. Please check that your firewall configuration allows UDP/4800 traffic. Actual pod output:
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vx-submariner, link-type EN10MB (Ethernet), snapshot length 262144 bytes

0 packets captured
0 packets received by filter
0 packets dropped by kernel

✓ Checking Globalnet configuration
✓ Globalnet is properly configured and functioning

✓ Checking if services have been exported properly
✓ All services have been exported properly

  • Gather information (use subctl gather):
    submariner-20230825140946.zip

  • Cloud provider or hardware configuration:
    2 Linode LKE -> shared CPU, 4GB RAM, 2 Worker Nodes

  • Install tools:
    kubectl

  • Others:

@Jdelachi Jdelachi added the bug Something isn't working label Aug 25, 2023
@Jdelachi
Copy link
Author

adding more info:

  • subctl version: v0.16.0-m3

@dfarrell07
Copy link
Member

@sridhargaddam was this related to submariner-io/submariner-operator#2769?

@sridhargaddam
Copy link
Member

@sridhargaddam was this related to submariner-io/submariner-operator#2769?

I'm afraid no, this is a different issue.

@sridhargaddam
Copy link
Member

Couple of observations after looking at the logs:

  1. I did not find any issues in the logs of both the clusters.
  2. In all the cluster nodes, I see a wireguard interface. Please check if this is causing any issues to the inter-cluster traffic.
wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none  promiscuity 0 minmtu 0 maxmtu 2147483552 
    wireguard numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    inet 172.31.1.1/32 scope global wg0
       valid_lft forever preferred_lft forever
  1. I think the clusters are currently deployed with Calico IPPool config as follows.
  ipipMode: Always
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Never

Instead of ipipMode, can you try VxLAN mode and see if it works.

  1. The CNI interface is properly detected in the logs (whereas in the logs that you shared on slack, I was seeing some errors). I think this is because you manually specified the --cluster-cidr as part of join. This is fine.

  2. In the output of subctl diagnose ... on the lke126870 cluster, we can see the following error.

✗ The tcpdump output from the sniffer pod does not contain the expected remote endpoint IP 240.0.0.0. Please check that your firewall configuration allows UDP/4800 traffic. Actual pod output

This cluster has two nodes, Gateway node (lke126870-188081-64e8b27e9db4) and non-Gateway node(lke126870-188081-64e8b27ef7d6). The above error implies that datapath is not working between the non-GW to the GW node.
Can you manually try deploying some pod on the non-GW node and try ping to 240.0.255.254 (which is the healthcheck-ip of cluster-a). While the ping is going on, on the GW node of the cluster try running tcpdump on vx-submariner interface. You should ideally see the ping traffic on this interface. If you are not able to see anything in the tcpdump it means some firewall configuration on the underlay is blocking the traffic.

@Jdelachi
Copy link
Author

@sridhargaddam
(I have deployed 2 new k8s clusters because I deleted the previous)
I have perform additional test, As you pointed out there is a connectivity issue between No-GW node and the GW node:

  • (Cluster-b) If I curl nginx.default.svc.clusterset.local:8080 from a pod in the GW node, it works.
  • (Cluster-b) If I curl nginx.default.svc.clusterset.local:8080 from a pod in the No-GW node, it does NOT work.
  • If I remove 1 node from cluster-b and keep just a cluster of 1 node, it works.
  • I have configure the broker with the following command to check if there was any issue in particular with the CIDR: subctl deploy-broker --globalnet --globalnet-cidr-range=70.0.0.0/8
    With the previous CIDR there is no error executing the command subctl diagnose all:

Cluster "lke127833"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.26.7" is supported

✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
✓ Clusters do not have overlapping globalnet CIDRs
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-globalnet"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking if gateway metrics are accessible from non-gateway nodes
✓ The gateway metrics are accessible
✓ Checking if globalnet metrics are accessible from non-gateway nodes
✓ The globalnet metrics are accessible

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ All connections are established
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✓ The firewall configuration allows intra-cluster VXLAN traffic
✓ Checking Globalnet configuration
✓ Globalnet is properly configured and functioning

✓ Checking if services have been exported properly
✓ All services have been exported properly

Cluster "lke127834"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.26.7" is supported

✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
✓ Clusters do not have overlapping globalnet CIDRs
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-globalnet"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking if gateway metrics are accessible from non-gateway nodes
✓ The gateway metrics are accessible
✓ Checking if globalnet metrics are accessible from non-gateway nodes
✓ The globalnet metrics are accessible

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ All connections are established
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✓ The firewall configuration allows intra-cluster VXLAN traffic
✓ Checking Globalnet configuration
✓ Globalnet is properly configured and functioning

✓ Checking if services have been exported properly
✓ All services have been exported properly

  • In the GW node this the network interface:

15: vx-submariner: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 8a:d3:45:25:25:90 brd ff:ff:ff:ff:ff:ff
inet 240.232.146.72/8 brd 240.255.255.255 scope global vx-submariner
valid_lft forever preferred_lft forever
inet6 fe80::88d3:45ff:fe25:2590/64 scope link
valid_lft forever preferred_lft forever

  • If I ping the IP 240.0.255.254 from a pod in a No-GW node, I get "Destination Host Unreachable":

bash-5.0# ping 240.0.255.254
PING 240.0.255.254 (240.0.255.254) 56(84) bytes of data.
From 240.232.146.73 icmp_seq=1 Destination Host Unreachable
From 240.232.146.73 icmp_seq=2 Destination Host Unreachable
From 240.232.146.73 icmp_seq=3 Destination Host Unreachable
From 240.232.146.73 icmp_seq=4 Destination Host Unreachable

@Jdelachi
Copy link
Author

Jdelachi commented Sep 4, 2023

As per the slack discussion, it seems that Submariner doesn't support Calico as CNI with IP in IP encapsulation mode.
Submariner support Calico CNI with just VXLAN encapsulation at the moment.

@sridhargaddam Do you think this should be moved to enhancement request (support Calico CNI with IP in IP encapsulation)?

@yboaron yboaron self-assigned this Sep 5, 2023
@yboaron
Copy link
Contributor

yboaron commented Sep 5, 2023

@Jdelachi As Sridhar mentioned this issue is similar to #2489 ,

A. It would be helpful ( it might give some more pointers) if you can test Submariner after changing default IPPool to VxLAN:always

B. As per IPinIP mode, we noticed that Submariner works on some platforms with Calico (like IBM ROKS) also when IPPool encap set to IPinIP always, but yeah further debugging is needed here to understand where and why the packets are getting dropped.

@Jdelachi
Copy link
Author

Jdelachi commented Sep 5, 2023

@yboaron I have tried to set IPPool to VxLAN:always but it breaks the cluster:

When I execute kubectl get svc:

E0831 20:22:48.064276 86865 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0831 20:22:48.111709 86865 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0831 20:22:48.165108 86865 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0831 20:22:48.221098 86865 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.128.0.1 443/TCP 68m
nginx ClusterIP 10.128.190.65 8080/TCP 20s

When I execute subctl export service --namespace default nginx:

✗ Failed to export Service: the server could not find the requested resource

@yboaron
Copy link
Contributor

yboaron commented Sep 5, 2023

OK, then set it back to IPinIP,

Meanwhile as workaround, you can try changing rp_filter to 2 for eth0 (check 1 ) in all non_gw nodes and see if that helps.

[1]

$ sysctl -w net.ipv4.conf.eth0.rp_filter=2
net.ipv4.conf.eth0.rp_filter = 2

@skitt skitt removed the needs-triage label Sep 5, 2023
@yboaron
Copy link
Contributor

yboaron commented Sep 18, 2023

@Jdelachi Any update on this issue ?

@Jdelachi
Copy link
Author

Hi @yboaron , it didn't fix it, same behavior.

@yboaron
Copy link
Contributor

yboaron commented Sep 20, 2023

Thanks for the update @Jdelachi ,
A. Could you please upload the latest 'subctl gather' logs
B. Is there any SG/Firewall in your env that might block traffic from the remote cluster?

@tpantelis tpantelis changed the title Submarine doesn't work on Linode k8s clusters Submariner doesn't work on Linode k8s clusters Sep 20, 2023
@Jdelachi
Copy link
Author

A) I attach the zip file with the content
submariner-20230922080350.zip

B) There is no firewall, just calico CNI using IP in IP encapsulation which enable BGP among nodes.

@yboaron
Copy link
Contributor

yboaron commented Oct 3, 2023

Thanks @Jdelachi ,

  1. Didn't find any issues in logs.
  2. Well it seems that
    A. pod@GW_node_clusterA 2 Service/pod@GW_node_clusterB is OK
    B. while pod@NON_GW_node_clusterA 2 Service/pod@NON_GW_node_clusterB fails

which suggests a datapath issue between GW_node to NON_GW node.

Could you please run test B while tcpdumping all 4 nodes ?

  1. As per firewall, maybe there's some firewall rule at infra level that blocks inter-cluster traffic?

For inter-cluster traffic, ClusterA for example should handle Rx packet with srcIP = some IP from ClusterB GN range (70.1.0.0/16) and destIP = IP from ClusterA pod CIDR range. some INFRAs only allow traffic when both SrcIP and destIP are in the local Cluster pod CIDR range.

@dfarrell07
Copy link
Member

If we get more debugging info or someone with cycles to focus on Calico they can find this with the label. For now, closing due to inactivity.

@dfarrell07 dfarrell07 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2023
@eremcan
Copy link

eremcan commented Dec 21, 2023

I'm also stuck at the same place with RKE1 Engine. It looks like there is a Bug or something. Please refer;
https://github.com/submariner-io/submariner/issues/2841

ps: I don't use ip in ip mode, But the result is the same. ( RKE 1 setup with Canal CNI)

root@d4kcp-node02:/opt# kubectl get configmap canal-config -n kube-system -o yaml
apiVersion: v1
REDACTED
  masquerade: "true"
  net-conf.json: |
    {
      "Network": "10.42.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }
  typha_service_name: none
  veth_mtu: "1450"

Any Update on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Calico need-info
Projects
No open projects
Status: Done
Development

No branches or pull requests

6 participants