Per-tenant replication (sharding) factor for index gateway #9574

chaudum · 2023-05-31T07:11:48Z

What this PR does / why we need it:

This PR introduces a per-tenant sharding (replication) factor for the index gateway.

At the moment we use a high replication factor in the index gateway ring to be able to distribute load from large tenants across a larger subset of index gateways. However, since the replication factor is a global setting in the gateway ring, also small tenants have a high replication factor, leading to a lot of data being pre-fetched at the start of the index gateways.

This PR adds a per-tenant runtime setting to specify a "sharding" factor between 0 and 1 that defines the per-tenant RF in the range of ring RF to max instances.

Example:

ring.RF = 3
gateway instances = 20
sharding factor = 0.75
=> 3 + (15 - 3) * 0.75 = 12

Using a factor instead of a fixed replication factor makes it easier to scale index gateways horizontally, because it does not require to adjust the setting when more instances are added to the ring.

Special notes for your reviewer:

This PR vendors grafana/dskit#304 which has to be merged first.

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

owen-d · 2023-05-31T18:37:42Z

This has a fundamental issue of linking the size of a tenant to the size of a cluster. For instance, if tenant a has a factor of 50% in a cluster with 10 instances, it gets 5 instances to run it's queries against. However, if another large tenant b joins that cell, increasing it from 10 -> 100 instances, tenant a now has access to 50 instances. That looks unnecessary, so we remediate their factor down to 5% to get back to the original optimal target (0.05 * 100 = 5). Then, tenant b stops it's business and we scale the cell to 10 again, but this limits tenant a's access to 0 or 1 node, well below it's needs.

Ideally, we'd link the size of a tenant to the resources it needs, so something like 10QPS=1 instance, 100QPS=10 instances, etc (not a function of the cell's size). That way, we can scale the size of a tenant independently from the size of the cluster the tenant is part of.

Pulls changes from grafana/dskit#304 Signed-off-by: Christian Haudum <[email protected]>

Signed-off-by: Christian Haudum <[email protected]>

chaudum · 2023-06-01T08:01:53Z

Ideally, we'd link the size of a tenant to the resources it needs, so something like 10QPS=1 instance, 100QPS=10 instances, etc (not a function of the cell's size). That way, we can scale the size of a tenant independently from the size of the cluster the tenant is part of.

Since the replication factor is not only used on the client side to determine request targets, but also on the server side to determine whether a tenant belongs to the instance to pre-fetch data, we will need to communicate the QPS (or other metric) between clients and servers.

The flow could look like this:

Each client tracks the request rate for each tenant.
The client sends the gathered metric to the "leader" instance of the ring (determined by a specific key).
The leader instance aggregates the request rate from all clients by tenant.
The leader responds to the client with the summed rate of all clients.
The client uses the response data to determine the RF for each tenant.
Server instances periodically (and on startup) request the aggregated rate from the leader instance to determine the RF for each tenant and therefore whether the instance "owns" the tenant.
While the leader instance is the source of truth for the rate metric, all instances keep a local copy of the last response for the case of a leader change.

Alternatively, instead of the client tracking the rate for each tenant, the server instances could track it. Then the server instances need to report to the leader and the clients only fetch the aggregated data.

chaudum · 2023-06-02T13:46:12Z

Phlare and Mimir do shuffle sharding based on user and block on their store gateways.
https://github.com/grafana/phlare/blob/main/pkg/storegateway/block_filter.go#L167-L177

There is a per-tenant setting for the shard size. This is similar to the implementation in this PR, but with an absolute number, rather than a factor.

chaudum · 2023-06-20T06:46:00Z

Superseded by #9710

pull-request-size bot added the size/L label May 31, 2023

github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label May 31, 2023

chaudum force-pushed the chaudum/index-gw-per-tenant-rf branch from 0cec265 to 3e500d9 Compare May 31, 2023 12:20

chaudum added the component/index-gateway label May 31, 2023

chaudum marked this pull request as ready for review May 31, 2023 12:23

chaudum requested review from JStickler and a team as code owners May 31, 2023 12:23

chaudum added 3 commits June 1, 2023 08:52

Upstream changes

f4abc45

Pulls changes from grafana/dskit#304 Signed-off-by: Christian Haudum <[email protected]>

Per tenant RF setting for index gateway sharding

cc3f04d

Signed-off-by: Christian Haudum <[email protected]>

Add changelog entry

7187350

Signed-off-by: Christian Haudum <[email protected]>

chaudum force-pushed the chaudum/index-gw-per-tenant-rf branch from 0de26f7 to 7187350 Compare June 1, 2023 06:53

chaudum marked this pull request as draft June 1, 2023 08:02

chaudum closed this Jun 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-tenant replication (sharding) factor for index gateway #9574

Per-tenant replication (sharding) factor for index gateway #9574

chaudum commented May 31, 2023

owen-d commented May 31, 2023

chaudum commented Jun 1, 2023

chaudum commented Jun 2, 2023 •

edited

Loading

chaudum commented Jun 20, 2023

Per-tenant replication (sharding) factor for index gateway #9574

Per-tenant replication (sharding) factor for index gateway #9574

Conversation

chaudum commented May 31, 2023

owen-d commented May 31, 2023

chaudum commented Jun 1, 2023

chaudum commented Jun 2, 2023 • edited Loading

chaudum commented Jun 20, 2023

chaudum commented Jun 2, 2023 •

edited

Loading