Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Per-tenant replication (sharding) factor for index gateway #9574

Closed
wants to merge 3 commits into from

Conversation

chaudum
Copy link
Contributor

@chaudum chaudum commented May 31, 2023

What this PR does / why we need it:

This PR introduces a per-tenant sharding (replication) factor for the index gateway.

At the moment we use a high replication factor in the index gateway ring to be able to distribute load from large tenants across a larger subset of index gateways. However, since the replication factor is a global setting in the gateway ring, also small tenants have a high replication factor, leading to a lot of data being pre-fetched at the start of the index gateways.

This PR adds a per-tenant runtime setting to specify a "sharding" factor between 0 and 1 that defines the per-tenant RF in the range of ring RF to max instances.

Example:

ring.RF = 3
gateway instances = 20
sharding factor = 0.75
=> 3 + (15 - 3) * 0.75 = 12

Using a factor instead of a fixed replication factor makes it easier to scale index gateways horizontally, because it does not require to adjust the setting when more instances are added to the ring.

Special notes for your reviewer:

This PR vendors grafana/dskit#304 which has to be merged first.

Checklist

  • Reviewed the CONTRIBUTING.md guide (required)
  • Documentation added
  • Tests updated
  • CHANGELOG.md updated
  • Changes that require user attention or interaction to upgrade are documented in docs/sources/upgrading/_index.md
  • For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR

@github-actions github-actions bot added the type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories label May 31, 2023
@chaudum chaudum force-pushed the chaudum/index-gw-per-tenant-rf branch from 0cec265 to 3e500d9 Compare May 31, 2023 12:20
@chaudum chaudum marked this pull request as ready for review May 31, 2023 12:23
@chaudum chaudum requested review from JStickler and a team as code owners May 31, 2023 12:23
@owen-d
Copy link
Member

owen-d commented May 31, 2023

This has a fundamental issue of linking the size of a tenant to the size of a cluster. For instance, if tenant a has a factor of 50% in a cluster with 10 instances, it gets 5 instances to run it's queries against. However, if another large tenant b joins that cell, increasing it from 10 -> 100 instances, tenant a now has access to 50 instances. That looks unnecessary, so we remediate their factor down to 5% to get back to the original optimal target (0.05 * 100 = 5). Then, tenant b stops it's business and we scale the cell to 10 again, but this limits tenant a's access to 0 or 1 node, well below it's needs.

Ideally, we'd link the size of a tenant to the resources it needs, so something like 10QPS=1 instance, 100QPS=10 instances, etc (not a function of the cell's size). That way, we can scale the size of a tenant independently from the size of the cluster the tenant is part of.

Pulls changes from grafana/dskit#304

Signed-off-by: Christian Haudum <[email protected]>
Signed-off-by: Christian Haudum <[email protected]>
@chaudum chaudum force-pushed the chaudum/index-gw-per-tenant-rf branch from 0de26f7 to 7187350 Compare June 1, 2023 06:53
@chaudum
Copy link
Contributor Author

chaudum commented Jun 1, 2023

Ideally, we'd link the size of a tenant to the resources it needs, so something like 10QPS=1 instance, 100QPS=10 instances, etc (not a function of the cell's size). That way, we can scale the size of a tenant independently from the size of the cluster the tenant is part of.

Since the replication factor is not only used on the client side to determine request targets, but also on the server side to determine whether a tenant belongs to the instance to pre-fetch data, we will need to communicate the QPS (or other metric) between clients and servers.

The flow could look like this:

  • Each client tracks the request rate for each tenant.
  • The client sends the gathered metric to the "leader" instance of the ring (determined by a specific key).
  • The leader instance aggregates the request rate from all clients by tenant.
  • The leader responds to the client with the summed rate of all clients.
  • The client uses the response data to determine the RF for each tenant.
  • Server instances periodically (and on startup) request the aggregated rate from the leader instance to determine the RF for each tenant and therefore whether the instance "owns" the tenant.
  • While the leader instance is the source of truth for the rate metric, all instances keep a local copy of the last response for the case of a leader change.

Alternatively, instead of the client tracking the rate for each tenant, the server instances could track it. Then the server instances need to report to the leader and the clients only fetch the aggregated data.

@chaudum chaudum marked this pull request as draft June 1, 2023 08:02
@chaudum
Copy link
Contributor Author

chaudum commented Jun 2, 2023

Phlare and Mimir do shuffle sharding based on user and block on their store gateways.
https://github.com/grafana/phlare/blob/main/pkg/storegateway/block_filter.go#L167-L177

There is a per-tenant setting for the shard size. This is similar to the implementation in this PR, but with an absolute number, rather than a factor.

@chaudum
Copy link
Contributor Author

chaudum commented Jun 20, 2023

Superseded by #9710

@chaudum chaudum closed this Jun 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/index-gateway size/L type/docs Issues related to technical documentation; the Docs Squad uses this label across many repositories
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants