Skip to content
This repository has been archived by the owner on Dec 14, 2023. It is now read-only.

Edge Router Session Sync

Andrew edited this page Nov 12, 2020 · 15 revisions

Problem

When the Ziti Edge system has a scenario where many Edge Routers need to reconnect at the same time, there is an incredible amount of strain put on the controller as it re-establishes connections. This scenario can be encountered due to network connectivity issues, controller restarts, or automation that starts many Edge Routers at the same time.

The following list enumerates items that occur when an Edge Router connects to an Edge enabled Controller:

  1. The Edge Router establishes a control plane connection to the Controller
  2. The Controller verifies the identity of the connection for Fabric control messaging
  3. The Controller beings to synchronize and update routing tables between the new router and existing routers
  4. The Controller verifies the Router for Edge Router status
  5. The Controller beings to stream all API Sessions and Sessions that are valid connections to that Edge Router

This article proposes a way to lessen the Controller's strain by altering how item 5 from above operates. Item 5 specifically attempts to stream ALL API Sessions and Sessions. This may be tens of thousands of items on a large system - taking considerable time to send. It is also dubious how many of those sessions are immediately necessary as zero or more Edge connections maybe be present. In an outage scenario, any existing Edge clients have most likely reconnected to another Edge Router. In non-redundant deployments, they may continue attempting to connect to a single Edge Router.

Proposed Solution

  • On Edge Router to Controller connection, the Controller attempts to send 0 API Sessions and Sessions.
  • On connection, each Edge Router will choose a random delay of between 3-30 seconds on which it will begin requesting any outstanding API sessions by sending the newest CreatedAt (DateTime) value it has seen or null. For this document, the request will be called a Session Sync Message.
  • The controller will respond with a configurable number of API Sessions and their related Sessions. The configuration will be a controller-level value.
  • The Edge Router will receive the list and note the new newest CreatedAt value.
  • The Edge Router will choose a new time out, up to 30s, randomly.
  • The above items will continue till the controller responds with 0 new API sessions. At this point, the Edge Router will consider itself 'synchronized' and will receive regular delta updates.

There are some caveats.

  • The controller may, and will, under heavy load, ignore any Session Sync Message. Any message outstanding for longer than 2 seconds will be ignored by both the Edge Router and Controller. As such, Session Syn Messages are considered the lowest priority in the system.
    • What exactly constitutes heavy load is up for debate.
  • A higher priority Session Check Message must be introduced in scenarios where Edge Routers are handling an incoming connection for an API Session or Session it is not aware of. These messages should be handled as quickly as reasonably possible.
  • Session Check Messages may be abused. Each Edge Router must not allow a session check on the same id to be performed faster than once every second. In addition, a configurable maximum number of outstanding Session Check Messages should be enforced. Clients attempting to connect where a Session Check Message is required may be ignored and experience a peer connection reset if the Edge Router is past its configurable maximum outstanding checks.
  • Edge Routers will have to maintain a list of "dirty" or "unverified" sessions after reconnecting to a controller. Those sessions will have to be verified by the controller before the Edge Router times them out.

Impact

During normal operation, there should be no noticeable difference in how the Ziti Edge systems act.

During a controller outage:

  • until all Session Sync Messages are handled, re-connecting Edge clients will see a noticeable delay when connecting to an Edge Router "in recovery" as Session Check Messages are used to validate individual incoming connections
  • Edge SDK's will encounter scenarios where their connections can be rejected when they are perfectly legal per the policy's defined.

Discussion Output

  • if the edge router rejects an SDK connection: send a message notifying the SDK that the edge router is in a degraded/resync state
  • need to have a configuration for how long a session is allowed to remain connected to the Edge Router when it lost connection to the controller
  • a pluggable strategy for session sync'ing / controller defense
    • strategy in the controller to defend itself
    • strategy in the edge router to sync
  • rate limit control channel from edge router and from SDK
    • report on sending bad requests (SDK or ER)
Clone this wiki locally