A75: xDS Aggregate Cluster Behavior Fixes

Author(s): @markdroth
Approver: @ejona86, @dfawley
Status: {Draft, In Review, Ready for Implementation, Implemented}
Implemented in: C-core
Last updated: 2024-02-28
Discussion at: https://groups.google.com/g/grpc-io/c/C1XAnRc2Z7E

Abstract

Our original design for xDS aggregate clusters incorrectly assumed that the intended behavior was for the LB policy to be configured at the aggregate cluster layer, not at the underlying cluster layers. This added a lot of complexity. It also forced us to impose a limitation that features like stateful session affinity and outlier detection would work only within a priority, rather than across all priorities in the cluster. This design describes how we will correct these problems.

Background

When we originally implemented xDS aggregate cluster support in gRFC A37, we misunderstood how the feature actually worked. Our discussions of aggregate cluster behavior in envoyproxy/envoy#13134 assumed that the core idea of an aggregate cluster was that it concatenated the priorities from the underlying clusters into a single list, so that it could use a single LB policy defined at the aggregate cluster layer to choose a priority from that combined list. Note that in Envoy, priorities are normally handled by the LB policy, so this would have made sense as a way to use the normal logic for choosing a priority.

However, it turns out that aggregate clusters don't actually define the LB policy in the aggregate cluster; instead, the aggregate cluster uses a special cluster-provided LB policy that first chooses the underlying cluster and then delegates to the LB policy of the underlying cluster. This is covered in the Envoy documentation, and you can see the Envoy implementation of this aggregate cluster LB policy here.

This incorrectly-perceived requirement led us to a design in which each of a cluster's priorities could have its configuration come from a different underlying cluster. This in turn meant that features like outlier detection (gRFC A50) and stateful session affinity (gRFCs A55 and A60) were designed to work only within a specific priority rather than working across all priorities within a cluster. This limitation is causing problems for stateful session affinity and is likely a source of friction for outlier detection as well.

This design describes how we are going to change our aggregate cluster implementation to solve these problems. It also describes some changes to the stateful session affinity design to support it working across priorities.

Related Proposals:

A37: xDS Aggregate and Logical DNS Clusters
A50: gRPC xDS Outlier Detection Support
A55: xDS-Based Stateful Session Affinity for Proxyless gRPC
A60: xDS-Based Stateful Session Affinity for Weighted Clusters
A56: Priority LB policy
A61: IPv4 and IPv6 Dualstack Backend Support
A74: xDS Config Tears (pending)

Proposal

There are two main parts to this proposal: aggregate cluster changes and stateful session affinity changes.

Aggregate Cluster Changes

Instead of representing an aggregate cluster as a single cluster whose priorities come from different underlying clusters, we will instead represent an aggregate cluster as an instance of a priority LB policy (gRFC A56) where each child is a cds LB policy for the underlying cluster.

This will allow us to move the outlier_detection, xds_cluster_impl, and xds_override_host LB policies up above the priority policy that chooses the priority within the cluster. This will result in outlier detection and stateful session affinity working across priorities within a cluster, as they should.

The architecture for a non-aggregate cluster will now look like this:

Link to SVG file

And the architecture for an aggregate cluster will now look like this:

Link to SVG file

One important consequence of this change is that we will no longer be getting the LB policy config from the aggregate cluster; instead, we will use the LB policy configured in the underlying cluster. However, to avoid causing backward-compatibility problems, we will not require the aggregate cluster to be configured with the CLUSTER_PROVIDED LB policy the way that Envoy does; instead, we will simply ignore the LB policy configured in the aggregate cluster. This seems unlikely to cause problems in practice, because Envoy's aggregate cluster functionality probably doesn't work with anything except the CLUSTER_PROVIDED LB policy anyway, so the LB policy in the aggregate cluster probably doesn't provide any useful information.

The one case where this may be a problem is if there are existing control planes being used with gRPC that set the LB policy differently in the aggregate cluster vs. the underlying clusters and then expect the LB policy in the aggregate cluster to be used. In order to provide a migration story for any such cases, we will support a temporary mechanism to tell gRPC to have the underlying clusters use the LB policy configured in the aggregate cluster. This will be enabled by setting the GRPC_XDS_AGGREGATE_CLUSTER_BACKWARD_COMPAT environment variable to true. This mechanism will be supported for a couple of gRPC releases but will be removed in the long run.

Stateful Session Affinity Changes

In order to make SSA work across priorities, it is necessary to handle the case where an endpoint moves between priorities. For example, if an endpoint was initially in priority 0 and is selected for a given session, we want that session to continue to send traffic to that endpoint, even if the endpoint moves to priority 1 and the priority policy continues to send non-SSA traffic to priority 0.

Currently, the xds_override_host policy (see gRFCs A55 and A60) expects the subchannels to be owned by its child policy in most cases. The only case where it directly takes ownership of subchannels is if they are in EDS health state DRAINING, since it filters those addresses out of the list it passes down to the child policy and therefore does not expect the child policy to create subchannels for those addresses. We unconditionally retain the subchannels for all endpoints in state DRAINING, even if they are not actually being used for a session, on the assumption that we probably already had a connection to the endpoint, and it's not that harmful to retain that connection until the endpoint finishes draining and is removed from the EDS update, which should be a relatively short amount of time.

However, now that the xds_override_host policy is being moved up above the priority policy, the xds_override_host policy may not see subchannels from its child policy for inactive priorities, because the priority policy lazily creates its child policies (see A56). Therefore, we need to generalize handling of subchannels that are not owned by the child policy such that it works for any endpoint, not just those in state DRAINING. And we need a mechanism to retain subchannels for only those endpoints that are actually being used in a session, since we don't want to hold on to connections for endpoints in a lower priority once they are no longer needed.

To accomplish this, we will change the xds_override_host LB policy to store a timestamp in each map entry indicating the last time the address was used for a session. We will then do a periodic sweep through the map to unref any subchannel has not been used for a session in a sufficiently long period of time, which will be configured via the idle_timeout field in the CDS resource.

Reading the idle_timeout field in CDS

When validating the CDS resource, we will look at the upstream_config field:

If the field is not present, we assume the default value for idle_timeout.
If the field is present:
- If the field does not contain a envoy.extensions.upstreams.http.v3.HttpProtocolOptions protobuf, we NACK.
- Otherwise, we look at the common_http_protocol_options field:
  - If the field is not present, we assume the default value for idle_timeout.
  - If the field is present, we look at the idle_timeout field:
    - If the field is unset, we assume the default value for idle_timeout.
    - Otherwise, as with all google.protobuf.Duration fields, we check the following:
      - If the seconds field is outside of the range [0, 315576000000], we NACK.
      - If the nanos field is outside of the range [0, 999999999], we NACK.
      - Otherwise, we use the specified value.

The idle_timeout defaults to 1 hour.

The XdsClient will return the idle_timeout value as a member of the parsed CDS resource.

Changes in xds_override_host LB Policy

We will make the following changes in the xds_override_host policy:

The xds_override_host policy will take ownership of any subchannel that the child policy does not own if needed for session affinity.
Each entry in the subchannel map will contain a last_used_time field, which will indicate the last time at which the entry was used for a session. Note that this field will be updated only when the xds_override_host picker sees the attribute from the stateful session HTTP filter; if that filter is not present, then SSA is not being used for the request, and we do not need to retain ownership of the subchannel.
The xds_override_host policy will have a timer that runs periodically to do a sweep over the subchannel map and unref any owned subchannel whose last_used_time is older than the idle_timeout value from the CDS resource. The timer can initially be set for idle_timeout duration, and subsequent runs can dynamically determine when to run based on the last_used_time of the entries in the map, with a minimum of 5 seconds between runs to minimize overhead.
When the child policy unrefs a subchannel it owns, if the entry's last_used_time is newer than the idle_timeout, the xds_override_host policy assumes ownership of the subchannel instead of deleting it.
In the picker, when we iterate through the addresses in the cookie, if we do not find any entries with a subchannel is states READY, IDLE, or CONNECTING and we do find entries that do not have any subchannel, we will trigger creation of a subchannel for one of those entries.

The picker logic will now look like this (pseudo-code):

def Pick(pick_args):
  override_host_attribute = pick_args.call_attributes.get(attribute_key)
  if override_host_attribute is not None:
    entry_with_no_subchannel = None
    idle_subchannel = None
    found_connecting = False
    for address in override_host_attribute.cookie_address_list:
      entry = lb_policy.address_map[address]
      if (entry found AND
          entry.health_status is in policy_config.override_host_status):
        if entry.subchannel is set:
          if entry.subchannel.connectivity_state == READY:
            override_host_attribute.set_actual_address_list(entry.address_list)
            return entry.subchannel as pick result
          elif entry.subchannel.connectivity_state == IDLE:
            if idle_subchannel is None:
              idle_subchannel = entry.subchannel
          elif entry.subchannel.connectivity_state == CONNECTING:
            found_connecting = True
        else if entry_with_no_subchannel is None:
          entry_with_no_subchannel = entry
    # No READY subchannel found.  If we found an IDLE subchannel,
    # trigger a connection attempt and queue the pick until that attempt
    # completes.
    if idle_subchannel is not None:
      hop into control plane to trigger connection attempt for idle_subchannel
      return queue as pick result
    # No READY or IDLE subchannels.  If we found a CONNECTING
    # subchannel, queue the pick and wait for the connection attempt
    # to complete.
    if found_connecting:
      return queue as pick result
    # No READY, IDLE, or CONNECTING subchannels.  If we found an entry
    # with no subchannel, queue the pick and create a subchannel for
    # that entry.
    if entry_with_no_subchannel is not None:
      hop into control plane to create subchannel for entry_with_no_subchannel
      return queue as pick result
  # pick_args.override_addresses not set or did not find a matching subchannel,
  # so delegate to the child picker.
  result = child_picker.Pick(pick_args)
  if result.type == PICK_COMPLETE:
    entry = lb_policy.address_map[result.subchannel.address()]
    if entry found:
      override_host_attribute.set_actual_address_list(entry.address_list)
  return result

Temporary environment variable protection

This design makes significant structural changes in the LB policy tree, which makes it infeasible to guard the new code-paths with an environment variable guard until it proves stable as we normally would.

Rationale

We considered implementing nested aggregate clusters as nested cds LB policy instances, each with their own underlying priority policy. However, since the XdsDependencyManager already needs to resolve the aggregate cluster dependency graph to make sure the entire config is loaded, it made sense to continue to pass down the flattened list and deal with it all in a single layer, just as we were doing prior to this design. Also, having nested priority policies would have caused problems with the failover timers, where the timer in the upper priority policy would fire before the lower priority policy has had a chance to try all of its children.

Implementation

C-core implementation:

read connection idle_timeout field from CDS resource: grpc/grpc#35395
change subchannel management in xds_override_host LB policy: grpc/grpc#35397
aggregate cluster architecture change: grpc/grpc#35313

The aggregate cluster changes will be implemented in Java, Go, and Node in the future. The stateful session affinity changes will be implemented in those languages if/when we need to support stateful session affinity in those languages.

Open issues (if applicable)

N/A

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A75-xds-aggregate-cluster-behavior-fixes.md

A75-xds-aggregate-cluster-behavior-fixes.md

A75: xDS Aggregate Cluster Behavior Fixes

Abstract

Background

Related Proposals:

Proposal

Aggregate Cluster Changes

Stateful Session Affinity Changes

Reading the idle_timeout field in CDS

Changes in xds_override_host LB Policy

Temporary environment variable protection

Rationale

Implementation

Open issues (if applicable)

Files

A75-xds-aggregate-cluster-behavior-fixes.md

Latest commit

History

A75-xds-aggregate-cluster-behavior-fixes.md

File metadata and controls

A75: xDS Aggregate Cluster Behavior Fixes

Abstract

Background

Related Proposals:

Proposal

Aggregate Cluster Changes

Stateful Session Affinity Changes

Reading the idle_timeout field in CDS

Changes in xds_override_host LB Policy

Temporary environment variable protection

Rationale

Implementation

Open issues (if applicable)