-
Notifications
You must be signed in to change notification settings - Fork 19
Edge Router Session Sync
When the Ziti Edge system has a scenario where many Edge Routers need to reconnect at the same time, there is an incredible amount of strain put on the controller as it re-establishes connections. This scenario can be encountered due to network connectivity issues, controller restarts, or automation that starts many Edge Routers at the same time.
The following list enumerates items that occur when an Edge Router connects to an Edge enabled Controller:
- The Edge Router establishes a control plane connection to the Controller
- The Controller verifies the identity of the connection for Fabric control messaging
- The Controller beings to synchronize and update routing tables between the new router and existing routers
- The Controller verifies the Router for Edge Router status
- The Controller beings to stream all API Sessions and Sessions that are valid connections to that Edge Router
This article proposes a way to lessen the Controller's strain by altering how item 5 from above operates. Item 5 specifically attempts to stream ALL API Sessions and Sessions. This may be tens of thousands of items on a large system - taking considerable time to send. It is also dubious how many of those sessions are immediately necessary as zero or more Edge connections maybe be present. In an outage scenario, any existing Edge clients have most likely reconnected to another Edge Router. In non-redundant deployments, they may continue attempting to connect to a single Edge Router.
- On Edge Router to Controller connection, the Controller attempts to send 0 API Sessions and Sessions.
- On connection, each Edge Router will choose a random delay of between 3-30 seconds on which it will begin requesting any outstanding API sessions by sending the newest
CreatedAt
(DateTime) value it has seen ornull.
For this document, the request will be called aSession Sync Message
. - The controller will respond with a configurable number of API Sessions and their related Sessions. The configuration will be a controller-level value.
- The Edge Router will receive the list and note the new newest
CreatedAt
value. - The Edge Router will choose a new time out, up to 30s, randomly.
- The above items will continue till the controller responds with 0 new API sessions. At this point, the Edge Router will consider itself 'synchronized' and will receive regular delta updates.
There are some caveats.
- The controller may, and will, under heavy load, ignore any
Session Sync Message
. Any message outstanding for longer than 2 seconds will be ignored by both the Edge Router and Controller. As such, Session Syn Messages are considered the lowest priority in the system.- What exactly constitutes heavy load is up for debate.
- A higher priority
Session Check Message
must be introduced in scenarios where Edge Routers are handling an incoming connection for an API Session or Session it is not aware of. These messages should be handled as quickly as reasonably possible. -
Session Check Message
s may be abused. Each Edge Router must not allow a session check on the same id to be performed faster than once every second. In addition, a configurable maximum number of outstandingSession Check Message
s should be enforced. Clients attempting to connect where aSession Check Message
is required may be ignored and experience a peer connection reset if the Edge Router is past its configurable maximum outstanding checks. - Edge Routers will have to maintain a list of "dirty" or "unverified" sessions after reconnecting to a controller. Those sessions will have to be verified by the controller before the Edge Router times them out.
During normal operation, there should be no noticeable difference in how the Ziti Edge systems act.
During a controller outage:
- until all
Session Sync Message
s are handled, re-connecting Edge clients will see a noticeable delay when connecting to an Edge Router "in recovery" asSession Check Message
s are used to validate individual incoming connections - Edge SDK's will encounter scenarios where their connections can be rejected when they are perfectly legal per the policy's defined.
- if the edge router rejects an SDK connection: send a message notifying the SDK that the edge router is in a degraded/resync state
- need to have a configuration for how long a session is allowed to remain connected to the Edge Router when it lost connection to the controller
- a pluggable strategy for session sync'ing / controller defense
- strategy in the controller to defend itself
- strategy in the edge router to sync
- rate limit control channel from edge router and from SDK
- report on sending bad requests (SDK or ER)