Submariner doesn't work on Linode k8s clusters #2660

Jdelachi · 2023-08-25T14:31:33Z

What happened:
I have tried to configure Submariner to establish connectivity between 2 Linode k8s clusters, but the connectivity is not successful.

What you expected to happen:
I expect to be able to establish connectivity between two Linode k8s clusters

How to reproduce it (as minimally and precisely as possible):

Create 2 Linode k8s cluster
Install Calico API server on both clusters (https://docs.tigera.io/calico/latest/operations/install-apiserver)
On cluster-a: subctl deploy-broker --globalnet --globalnet-cidr-range=240.0.0.0/8
On Cluster-a: subctl join broker-info.subm --clusterid cluster-a --check-broker-certificate=false --clustercidr 10.2.0.0/16 --servicecidr 10.128.0.0/16
On Cluster-b: subctl join broker-info.subm --clusterid cluster-b --check-broker-certificate=false --clustercidr 10.2.0.0/16 --servicecidr 10.128.0.0/16
Follow this official example to test connectivity: https://submariner.io/getting-started/quickstart/openshift/globalnet/
I get a timeout executing: curl nginx.default.svc.clusterset.local:8080

Anything else we need to know?:
In Linode the k8s cluster always has the same podCIDR and servicesCIDR.
podCIDR -> 10.2.0.0/16
servicesCIDR-> 10.128.0.0/16

Environment:

Diagnose information (use subctl diagnose all):

Cluster "lke126869"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.26.7" is supported

✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
✓ Clusters do not have overlapping globalnet CIDRs
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-globalnet"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking if gateway metrics are accessible from non-gateway nodes
✓ The gateway metrics are accessible
✓ Checking if globalnet metrics are accessible from non-gateway nodes
✓ The globalnet metrics are accessible

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ All connections are established
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✓ The firewall configuration allows intra-cluster VXLAN traffic
✓ Checking Globalnet configuration
✓ Globalnet is properly configured and functioning

✓ Checking if services have been exported properly
✓ All services have been exported properly

Cluster "lke126870"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.26.7" is supported

✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
✓ Clusters do not have overlapping globalnet CIDRs
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-globalnet"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking if gateway metrics are accessible from non-gateway nodes
✓ The gateway metrics are accessible
✓ Checking if globalnet metrics are accessible from non-gateway nodes
✓ The globalnet metrics are accessible

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ All connections are established
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✗ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✗ The tcpdump output from the sniffer pod does not contain the expected remote endpoint IP 240.0.0.0. Please check that your firewall configuration allows UDP/4800 traffic. Actual pod output:
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vx-submariner, link-type EN10MB (Ethernet), snapshot length 262144 bytes

0 packets captured
0 packets received by filter
0 packets dropped by kernel

✓ Checking Globalnet configuration
✓ Globalnet is properly configured and functioning

✓ Checking if services have been exported properly
✓ All services have been exported properly

Gather information (use subctl gather):
submariner-20230825140946.zip
Cloud provider or hardware configuration:
2 Linode LKE -> shared CPU, 4GB RAM, 2 Worker Nodes
Install tools:
kubectl
Others:

The text was updated successfully, but these errors were encountered:

Jdelachi · 2023-08-27T13:55:30Z

adding more info:

subctl version: v0.16.0-m3

dfarrell07 · 2023-08-29T13:21:57Z

@sridhargaddam was this related to submariner-io/submariner-operator#2769?

sridhargaddam · 2023-08-29T17:13:43Z

@sridhargaddam was this related to submariner-io/submariner-operator#2769?

I'm afraid no, this is a different issue.

sridhargaddam · 2023-08-30T19:02:35Z

Couple of observations after looking at the logs:

I did not find any issues in the logs of both the clusters.
In all the cluster nodes, I see a wireguard interface. Please check if this is causing any issues to the inter-cluster traffic.

wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none  promiscuity 0 minmtu 0 maxmtu 2147483552 
    wireguard numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535 
    inet 172.31.1.1/32 scope global wg0
       valid_lft forever preferred_lft forever

I think the clusters are currently deployed with Calico IPPool config as follows.

  ipipMode: Always
  natOutgoing: true
  nodeSelector: all()
  vxlanMode: Never

Instead of ipipMode, can you try VxLAN mode and see if it works.

The CNI interface is properly detected in the logs (whereas in the logs that you shared on slack, I was seeing some errors). I think this is because you manually specified the --cluster-cidr as part of join. This is fine.
In the output of subctl diagnose ... on the lke126870 cluster, we can see the following error.

✗ The tcpdump output from the sniffer pod does not contain the expected remote endpoint IP 240.0.0.0. Please check that your firewall configuration allows UDP/4800 traffic. Actual pod output

This cluster has two nodes, Gateway node (lke126870-188081-64e8b27e9db4) and non-Gateway node(lke126870-188081-64e8b27ef7d6). The above error implies that datapath is not working between the non-GW to the GW node.
Can you manually try deploying some pod on the non-GW node and try ping to 240.0.255.254 (which is the healthcheck-ip of cluster-a). While the ping is going on, on the GW node of the cluster try running tcpdump on vx-submariner interface. You should ideally see the ping traffic on this interface. If you are not able to see anything in the tcpdump it means some firewall configuration on the underlay is blocking the traffic.

Jdelachi · 2023-08-31T14:19:04Z

@sridhargaddam
(I have deployed 2 new k8s clusters because I deleted the previous)
I have perform additional test, As you pointed out there is a connectivity issue between No-GW node and the GW node:

(Cluster-b) If I curl nginx.default.svc.clusterset.local:8080 from a pod in the GW node, it works.
(Cluster-b) If I curl nginx.default.svc.clusterset.local:8080 from a pod in the No-GW node, it does NOT work.
If I remove 1 node from cluster-b and keep just a cluster of 1 node, it works.
I have configure the broker with the following command to check if there was any issue in particular with the CIDR: subctl deploy-broker --globalnet --globalnet-cidr-range=70.0.0.0/8
With the previous CIDR there is no error executing the command subctl diagnose all:

Cluster "lke127833"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.26.7" is supported

✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
✓ Clusters do not have overlapping globalnet CIDRs
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-globalnet"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking if gateway metrics are accessible from non-gateway nodes
✓ The gateway metrics are accessible
✓ Checking if globalnet metrics are accessible from non-gateway nodes
✓ The globalnet metrics are accessible

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ All connections are established
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✓ The firewall configuration allows intra-cluster VXLAN traffic
✓ Checking Globalnet configuration
✓ Globalnet is properly configured and functioning

✓ Checking if services have been exported properly
✓ All services have been exported properly

Cluster "lke127834"
✓ Checking Submariner support for the Kubernetes version
✓ Kubernetes version "v1.26.7" is supported

✓ Globalnet deployment detected - checking if globalnet CIDRs overlap
✓ Clusters do not have overlapping globalnet CIDRs
✓ Checking DaemonSet "submariner-gateway"
✓ Checking DaemonSet "submariner-routeagent"
✓ Checking DaemonSet "submariner-globalnet"
✓ Checking DaemonSet "submariner-metrics-proxy"
✓ Checking Deployment "submariner-lighthouse-agent"
✓ Checking Deployment "submariner-lighthouse-coredns"
✓ Checking the status of all Submariner pods
✓ Checking if gateway metrics are accessible from non-gateway nodes
✓ The gateway metrics are accessible
✓ Checking if globalnet metrics are accessible from non-gateway nodes
✓ The globalnet metrics are accessible

✓ Checking Submariner support for the CNI network plugin
✓ The detected CNI network plugin ("calico") is supported
✓ Calico CNI detected, checking if the Submariner IPPool pre-requisites are configured
✓ Checking gateway connections
✓ All connections are established
✓ Checking Submariner support for the kube-proxy mode
✓ The kube-proxy mode is supported
✓ Checking the firewall configuration to determine if intra-cluster VXLAN traffic is allowed
✓ The firewall configuration allows intra-cluster VXLAN traffic
✓ Checking Globalnet configuration
✓ Globalnet is properly configured and functioning

✓ Checking if services have been exported properly
✓ All services have been exported properly

In the GW node this the network interface:

15: vx-submariner: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UNKNOWN group default
link/ether 8a:d3:45:25:25:90 brd ff:ff:ff:ff:ff:ff
inet 240.232.146.72/8 brd 240.255.255.255 scope global vx-submariner
valid_lft forever preferred_lft forever
inet6 fe80::88d3:45ff:fe25:2590/64 scope link
valid_lft forever preferred_lft forever

If I ping the IP 240.0.255.254 from a pod in a No-GW node, I get "Destination Host Unreachable":

bash-5.0# ping 240.0.255.254
PING 240.0.255.254 (240.0.255.254) 56(84) bytes of data.
From 240.232.146.73 icmp_seq=1 Destination Host Unreachable
From 240.232.146.73 icmp_seq=2 Destination Host Unreachable
From 240.232.146.73 icmp_seq=3 Destination Host Unreachable
From 240.232.146.73 icmp_seq=4 Destination Host Unreachable

Jdelachi · 2023-09-04T08:02:43Z

As per the slack discussion, it seems that Submariner doesn't support Calico as CNI with IP in IP encapsulation mode.
Submariner support Calico CNI with just VXLAN encapsulation at the moment.

@sridhargaddam Do you think this should be moved to enhancement request (support Calico CNI with IP in IP encapsulation)?

yboaron · 2023-09-05T08:02:51Z

@Jdelachi As Sridhar mentioned this issue is similar to #2489 ,

A. It would be helpful ( it might give some more pointers) if you can test Submariner after changing default IPPool to VxLAN:always

B. As per IPinIP mode, we noticed that Submariner works on some platforms with Calico (like IBM ROKS) also when IPPool encap set to IPinIP always, but yeah further debugging is needed here to understand where and why the packets are getting dropped.

Jdelachi · 2023-09-05T08:12:20Z

@yboaron I have tried to set IPPool to VxLAN:always but it breaks the cluster:

When I execute kubectl get svc:

E0831 20:22:48.064276 86865 memcache.go:287] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0831 20:22:48.111709 86865 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0831 20:22:48.165108 86865 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
E0831 20:22:48.221098 86865 memcache.go:121] couldn't get resource list for projectcalico.org/v3: the server is currently unable to handle the request
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.128.0.1 443/TCP 68m
nginx ClusterIP 10.128.190.65 8080/TCP 20s

When I execute subctl export service --namespace default nginx:

✗ Failed to export Service: the server could not find the requested resource

yboaron · 2023-09-05T08:31:41Z

OK, then set it back to IPinIP,

Meanwhile as workaround, you can try changing rp_filter to 2 for eth0 (check 1 ) in all non_gw nodes and see if that helps.

[1]

$ sysctl -w net.ipv4.conf.eth0.rp_filter=2
net.ipv4.conf.eth0.rp_filter = 2

yboaron · 2023-09-18T11:55:58Z

@Jdelachi Any update on this issue ?

Jdelachi · 2023-09-18T18:34:17Z

Hi @yboaron , it didn't fix it, same behavior.

yboaron · 2023-09-20T14:03:58Z

Thanks for the update @Jdelachi ,
A. Could you please upload the latest 'subctl gather' logs
B. Is there any SG/Firewall in your env that might block traffic from the remote cluster?

Jdelachi · 2023-09-22T08:09:49Z

A) I attach the zip file with the content
submariner-20230922080350.zip

B) There is no firewall, just calico CNI using IP in IP encapsulation which enable BGP among nodes.

yboaron · 2023-10-03T07:44:15Z

Thanks @Jdelachi ,

Didn't find any issues in logs.
Well it seems that
A. pod@GW_node_clusterA 2 Service/pod@GW_node_clusterB is OK
B. while pod@NON_GW_node_clusterA 2 Service/pod@NON_GW_node_clusterB fails

which suggests a datapath issue between GW_node to NON_GW node.

Could you please run test B while tcpdumping all 4 nodes ?

As per firewall, maybe there's some firewall rule at infra level that blocks inter-cluster traffic?

For inter-cluster traffic, ClusterA for example should handle Rx packet with srcIP = some IP from ClusterB GN range (70.1.0.0/16) and destIP = IP from ClusterA pod CIDR range. some INFRAs only allow traffic when both SrcIP and destIP are in the local Cluster pod CIDR range.

dfarrell07 · 2023-11-14T14:11:16Z

If we get more debugging info or someone with cycles to focus on Calico they can find this with the label. For now, closing due to inactivity.

eremcan · 2023-12-21T06:56:14Z

I'm also stuck at the same place with RKE1 Engine. It looks like there is a Bug or something. Please refer;
https://github.com/submariner-io/submariner/issues/2841

ps: I don't use ip in ip mode, But the result is the same. ( RKE 1 setup with Canal CNI)

root@d4kcp-node02:/opt# kubectl get configmap canal-config -n kube-system -o yaml
apiVersion: v1
REDACTED
  masquerade: "true"
  net-conf.json: |
    {
      "Network": "10.42.0.0/16",
      "Backend": {
        "Type": "vxlan"
      }
    }
  typha_service_name: none
  veth_mtu: "1450"

Any Update on this issue?

Jdelachi added the bug Something isn't working label Aug 25, 2023

sridhargaddam added the Calico label Aug 25, 2023

dfarrell07 added the needs-triage label Aug 29, 2023

sridhargaddam mentioned this issue Sep 1, 2023

Submariner is not working for LKE #2489

Closed

yboaron self-assigned this Sep 5, 2023

skitt removed the needs-triage label Sep 5, 2023

tpantelis changed the title ~~Submarine doesn't work on Linode k8s clusters~~ Submariner doesn't work on Linode k8s clusters Sep 20, 2023

yboaron added the need-info label Oct 22, 2023

dfarrell07 closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2023

eremcan mentioned this issue Jan 10, 2024

Vx-Submariner issue On Rancher RKE1 Setup #2841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Submariner doesn't work on Linode k8s clusters #2660

Submariner doesn't work on Linode k8s clusters #2660

Jdelachi commented Aug 25, 2023 •

edited

Loading

Jdelachi commented Aug 27, 2023

dfarrell07 commented Aug 29, 2023

sridhargaddam commented Aug 29, 2023

sridhargaddam commented Aug 30, 2023

Jdelachi commented Aug 31, 2023

Jdelachi commented Sep 4, 2023

yboaron commented Sep 5, 2023

Jdelachi commented Sep 5, 2023

yboaron commented Sep 5, 2023

yboaron commented Sep 18, 2023

Jdelachi commented Sep 18, 2023

yboaron commented Sep 20, 2023

Jdelachi commented Sep 22, 2023

yboaron commented Oct 3, 2023 •

edited

Loading

dfarrell07 commented Nov 14, 2023

eremcan commented Dec 21, 2023 •

edited

Loading

Submariner doesn't work on Linode k8s clusters #2660

Submariner doesn't work on Linode k8s clusters #2660

Comments

Jdelachi commented Aug 25, 2023 • edited Loading

Jdelachi commented Aug 27, 2023

dfarrell07 commented Aug 29, 2023

sridhargaddam commented Aug 29, 2023

sridhargaddam commented Aug 30, 2023

Jdelachi commented Aug 31, 2023

Jdelachi commented Sep 4, 2023

yboaron commented Sep 5, 2023

Jdelachi commented Sep 5, 2023

yboaron commented Sep 5, 2023

yboaron commented Sep 18, 2023

Jdelachi commented Sep 18, 2023

yboaron commented Sep 20, 2023

Jdelachi commented Sep 22, 2023

yboaron commented Oct 3, 2023 • edited Loading

dfarrell07 commented Nov 14, 2023

eremcan commented Dec 21, 2023 • edited Loading

Jdelachi commented Aug 25, 2023 •

edited

Loading

yboaron commented Oct 3, 2023 •

edited

Loading

eremcan commented Dec 21, 2023 •

edited

Loading