You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have inconsistencies on startup with FRR related to import route-targets on an EVPN VRF. Sometimes, the route-target import xxx:yyy is ignored (not shown in a show run) or not replacing the automatic one of *:<vni>. In both cases, all routes matching the assigned part of the RT are imported, breaking the isolation from the admin part of the RT.
This is related to adding a VXLAN SVI (vxlan + bridge + vrf) during FRR startup when the configuration is loaded, e.g. in a lab with containerlab.
This is brought up by the following containerlab snippet (there are actually 6 in the lab but are all the same, the lab is templatized):
name: evpnmgmt:
network: fixedipsipv4-subnet: 172.20.20.0/22topology:
nodes:
sw01:
kind: linuximage: quay.io/frrouting/frr:10.1.1binds:
- frr-daemons:/etc/frr/daemons
- sw01.conf:/etc/frr/frr.confmgmt-ipv4: 172.20.20.11exec:
- ip link add vrf-1777 up type vrf table 1777
- ip link add br-1777 up master vrf-1777 type bridge
- ip link add vxlan-1777 up master br-1777 type vxlan id 1777 dstport 4789 local 172.20.20.11 nolearning
- ip link set eth1 up master vrf-vpc-1777
- ip addr add 169.254.0.0/31 dev eth1
the RR at 172.20.21.1 announces evpn type-5 routes with the 12876:1777 RT (192.168.2.2/32 via 172.20.23.243)
the ebgp sessions at 169.254.0.1 all announce the 192.168.1.1/32 prefix, re-announced as an evpn type-5 route with the 64699:1777 RT
Quite often (in the 10% range) the bringup kind of fails on the route-target import 12876:1777 part. Either it is plain missing, or it is plain ignored using the default 0:1777 RT acting as a wildcard.
However, with the route-target import 12876:1777 line we expect only 192.168.2.2/32 to be imported, but it is not the case:
# sh ip route vrf vrf-1777
-- snip --
VRF vrf-1777:
C>* 169.254.0.0/31 is directly connected, eth1, 00:15:05
L>* 169.254.0.0/32 is directly connected, eth1, 00:15:05
B>* 192.168.1.1/32 [200/200] via 172.20.20.11, br-1777 onlink, weight 1, 00:15:00
* via 172.20.20.12, br-1777 onlink, weight 1, 00:15:00
B>* 192.168.2.2/32 [200/100] via 172.20.23.243, br-1777 onlink, weight 1, 00:15:00
expected output being:
sw01# sh ip route vrf vrf-1777
-- snip --
VRF vrf-vpc-1777:
C>* 169.254.0.0/31 is directly connected, eth1, 00:39:56
L>* 169.254.0.0/32 is directly connected, eth1, 00:39:56
B>* 192.168.1.1/32 [20/0] via 169.254.0.1, eth1, weight 1, 00:39:51
B>* 192.168.2.2/32 [200/200] via 172.20.23.243, br-1777 onlink, weight 1, 00:39:50
Inpecting bgp yields some... weird results:
# sh bgp l2vpn evpn vrf-import-rt
Route-target: 12876:1777
List of VRFs importing routes with this route-target:
vrf-1777
Route-target: 0:1777
List of VRFs importing routes with this route-target:
vrf-1777
0:1777 is the default one being created for auto-rt, which is supposed to be removed by the manual import rt. Correct frr instances show the following:
# sh bgp l2vpn evpn vrf-import-rt
Route-target: 12876:1777
List of VRFs importing routes with this route-target:
vrf-1777
After some discussions with Trey on slack, we tried to down-up the SVI on the kernel side:
ip link set vrf-1777 down
ip link set br-1777 down
ip link set vxlan-1777 down
ip link set vrf-1777 up
ip link set br-1777 up
ip link set vxlan-1777 up
Which fixed the routing table from show ip route vrf vrf-1777, but still has the incorrect output from sh bgp l2vpn evpn vrf-import-rt. Which is, while the routing table is correct wrt my intent, it is incorrect from a vrf-import-rt standpoint. bgpd is now even more inconsistent.
This was furthen broken after discussions. We tried to deconf and reconf the import rt, which corrupted bgpd further more:
sw01(config)# router bgp 12876 vrf vrf-1777
sw01(config-bgp)# address-family l2vpn evpn
sw01(config-router-af)# no route-target import 12876:1777
sw01(config-router-af)# do sh bgp l2vpn evpn vrf-import-rt
Route-target: 0:1777
List of VRFs importing routes with this route-target:
vrf-1777
vrf-1777
sw01(config-router-af)# route-target import 12876:1777
% RT specified already configured for this VRF: 12876:1777
sw01(config-router-af)# do sh bgp l2vpn evpn vrf-import-rt
Route-target: 0:1777
List of VRFs importing routes with this route-target:
vrf-1777
vrf-1777
(the double vrf-1777 outputs are not copy/paste artifacts, those were the actual vtysh outputs).
So in this state:
the same vrf is mentioned twice
12876:1777 is not there anymore but still there at the same time
and the routing table was "corrent" for 12876:1777 but incorrect for 0:1777 (what bgpd tells us it filters on)
As for why this is very likely linked to a race between netlink and vtysh, when I change the containerlab exec section from
exec:
- ip link add vrf-1777 up type vrf table 1777
- ip link add br-1777 up master vrf-1777 type bridge
- ip link add vxlan-1777 up master br-1777 type vxlan id 1777 dstport 4789 local 172.20.20.11 nolearning
to
exec:
- ip link add vrf-1777 type vrf table 1777
- ip link add br-1777 master vrf-1777 type bridge
- ip link add vxlan-1777 master br-1777 type vxlan id 1777 dstport 4789 local 172.20.20.11 nolearning
- ip link set vrf-1777 up
- ip link set br-1777 up
- ip link set vxlan-1777 up
the issue could not be reproduced in 20+ restarts of the lab (which has 6 nodes susceptible of the bug), while it happens every two to three restarts on average with the original scripts.
I did not dig into the code as I don't have time for this right now, but I hope to.
Thanks!
Version
sw01# show version
FRRouting 10.1.1_git (sw01) on Linux(6.11.5-arch1-1).
Copyright 1996-2005 Kunihiro Ishiguro, et al.
configured with:
'--prefix=/usr' '--sysconfdir=/etc' '--localstatedir=/var' '--sbindir=/usr/lib/frr' '--libdir=/usr/lib' '--enable-rpki' '--enable-vtysh' '--enable-multipath=64' '--enable-vty-group=frrvty' '--enable-user=frr' '--enable-group=frr' '--enable-pcre2posix' '--enable-scripting' 'CC=gcc' 'CXX=g++'
How to reproduce
Start an FRR with the above config a bunch of times, creating the svi as frr loads its configuration file. The ebgp peer can be skipped as it does not matter for the issue.
I am using containerlab for convenience, and it may have just the right timing to trigger the issue. A lab with 6 nodes triggers the issue every two to three restarts of the lab (containerlab deploy --reconfigure).
As this is a timing issue, the hardware used is important. I'm running this on a Ryzen 4650U with an NVMe drive.
Expected behavior
I expect my import rt to be the only one used:
sw02# sh bgp l2vpn evpn vrf-import-rt
Route-target: 12876:1777
List of VRFs importing routes with this route-target:
vrf-1777
Actual behavior
My RT and the default "catch-all" RT are present:
sw01# sh bgp l2vpn evpn vrf-import-rt
Route-target: 12876:1777
List of VRFs importing routes with this route-target:
vrf-1777
Route-target: 0:1777
List of VRFs importing routes with this route-target:
vrf-1777
With no way of clearing the 0:1777 RT.
Additional context
This is not the first time I've had consistency issues with FRR when upping the parts of an SVI as I create them (ip link add ... up ...). Create then upping is much more robust.
Checklist
I have searched the open issues for this bug.
I have not included sensitive information in this report.
The text was updated successfully, but these errors were encountered:
Description
Hi,
I have inconsistencies on startup with FRR related to import route-targets on an EVPN VRF. Sometimes, the
route-target import xxx:yyy
is ignored (not shown in ashow run
) or not replacing the automatic one of*:<vni>
. In both cases, all routes matching the assigned part of the RT are imported, breaking the isolation from the admin part of the RT.This is related to adding a VXLAN SVI (vxlan + bridge + vrf) during FRR startup when the configuration is loaded, e.g. in a lab with containerlab.
The config is as follows:
This is brought up by the following containerlab snippet (there are actually 6 in the lab but are all the same, the lab is templatized):
Quite often (in the 10% range) the bringup kind of fails on the
route-target import 12876:1777
part. Either it is plain missing, or it is plain ignored using the default0:1777
RT acting as a wildcard.show run
properly shows it in the conf:However, with the
route-target import 12876:1777
line we expect only 192.168.2.2/32 to be imported, but it is not the case:expected output being:
Inpecting bgp yields some... weird results:
0:1777
is the default one being created for auto-rt, which is supposed to be removed by the manual import rt. Correct frr instances show the following:After some discussions with Trey on slack, we tried to down-up the SVI on the kernel side:
Which fixed the routing table from
show ip route vrf vrf-1777
, but still has the incorrect output fromsh bgp l2vpn evpn vrf-import-rt
. Which is, while the routing table is correct wrt my intent, it is incorrect from a vrf-import-rt standpoint. bgpd is now even more inconsistent.This was furthen broken after discussions. We tried to deconf and reconf the import rt, which corrupted bgpd further more:
(the double vrf-1777 outputs are not copy/paste artifacts, those were the actual vtysh outputs).
So in this state:
As for why this is very likely linked to a race between netlink and vtysh, when I change the containerlab exec section from
to
the issue could not be reproduced in 20+ restarts of the lab (which has 6 nodes susceptible of the bug), while it happens every two to three restarts on average with the original scripts.
I did not dig into the code as I don't have time for this right now, but I hope to.
Thanks!
Version
How to reproduce
Start an FRR with the above config a bunch of times, creating the svi as frr loads its configuration file. The ebgp peer can be skipped as it does not matter for the issue.
I am using containerlab for convenience, and it may have just the right timing to trigger the issue. A lab with 6 nodes triggers the issue every two to three restarts of the lab (
containerlab deploy --reconfigure
).Docker image:
quay.io/frrouting/frr:10.1.1
Docker version:
27.3.1, build ce1223035a
Kernel version:
6.11.5-arch1-1
Containerlab version:
version: 0.56.0, commit: b593b206
As this is a timing issue, the hardware used is important. I'm running this on a Ryzen 4650U with an NVMe drive.
Expected behavior
I expect my import rt to be the only one used:
Actual behavior
My RT and the default "catch-all" RT are present:
With no way of clearing the
0:1777
RT.Additional context
This is not the first time I've had consistency issues with FRR when upping the parts of an SVI as I create them (
ip link add ... up ...
). Create then upping is much more robust.Checklist
The text was updated successfully, but these errors were encountered: