Mimir Read Latency Errors (MimirCacheRequestErrors & MimirRequestLatency) #7687

james3cardenas · 2024-03-18T22:19:04Z

james3cardenas
Mar 18, 2024

Describe the bug

I'm trying to understand how best to improve our Mimir read request latency and have been unable to make much progress on resolving the following error alerts for long range queries (ex. 30 days).

Details below (any additional details available upon request):

MimirCacheRequestErrors
The cache index-cache used by Mimir grafana-mimir/devops-prod is experiencing 7.79% errors for getmulti operation.
The cache chunks-cache used by Mimir grafana-mimir is experiencing 97.13% errors for getmulti operation.
The cache chunks-cache used by Mimir grafana-mimir is experiencing 66.55% errors for set operation.

Scaling the memcached cluster doesn't seem to resolve the timeouts. The timeouts are reduced if I increase the store-gateway's memcached client timeout, however memcached get/set latency seems to scale proportionally with this configuration, so adjusting this to something larger then 450ms seems unreasonable.

MimirRequestLatency
querier prometheus_api_v1_query_range is experiencing 24.69s 99th percentile latency.
store-gateway /gatewaypb.StoreGateway/Series is experiencing 46.06s 99th percentile latency.

It's unclear to me whether fixing the memcached timeouts will help resolve this alert. I can sometimes see the Queue length panel suggesting we have queries that are waiting in queue due to busy queriers, and have considered scaling the queriers up but I want to be sure that this doesn't cause further issues downstream (ex. causing higher load on store-gateways, ingesters, and/or query frontend).

To Reproduce

Steps to reproduce the behavior:

Start Mimir 2.10.5
Perform Operations Read Operations

Expected behavior

I expect the MimirCacheRequestErrors & MimirRequestLatency not to trigger.

Environment

Infrastructure: Kubernetes
Deployment tool: helm

Additional Context

values.yaml

image:
  repository: grafana/mimir
  tag: 2.10.5

memcached:
  image:
    repository: memcached
    tag: 1.6.19-alpine

chunks-cache:
  replicas: 4
  port: 11211
  allocatedMemory: 8192
  maxItemMemory: 1

index-cache:
  replicas: 4
  port: 11211
  allocatedMemory: 2048
  maxItemMemory: 5

metadata-cache:
  replicas: 1
  port: 11211
  allocatedMemory: 512
  maxItemMemory: 1

results-cache:
  replicas: 4
  port: 11211
  allocatedMemory: 1024
  maxItemMemory: 5

store_gateway:
  replicas: 6
  resources:
    limits:
      memory: 8.5Gi
    requests:
      cpu: 1
      memory: 6Gi

ingester:
  replicas: 15
  resources:
    limits:
      cpu: 5
      memory: 12Gi
    requests:
      cpu: 3.5
      memory: 8Gi

querier:
  replicas: 4
  resources:
    limits:
      cpu: 2
      memory: 12Gi
    requests:
      cpu: 2
      memory: 8Gi

query_frontend:
  replicas: 3
  resources:
    limits:
      cpu: 2
      memory: 2.8Gi
    requests:
      cpu: 2
      memory: 2Gi

query_scheduler:
  replicas: 2
  resources:
    requests:
      cpu: 100m
      memory: 128Mi

mimir.yaml

blocks_storage:
  backend: s3
  bucket_store:
    chunks_cache:
      backend: memcached
      memcached:
        addresses: dns+grafana-mimir-chunks-cache.prod.svc:11211
        max_idle_connections: 150
        max_item_size: 1048576
        timeout: 450ms
        tls_enabled: true
    index_cache:
      backend: memcached
      memcached:
        addresses: dns+grafana-mimir-index-cache.prod.svc:11211
        max_idle_connections: 150
        max_item_size: 5242880
        timeout: 450ms
        tls_enabled: true
    metadata_cache:
      backend: memcached
      memcached:
        addresses: dns+grafana-mimir-metadata-cache.prod.svc:11211
        max_idle_connections: 150
        max_item_size: 1048576
        tls_enabled: true
  
common:
  storage:
    backend: s3

frontend:
  cache_results: true
  grpc_client_config:
    tls_enabled: true
  parallelize_shardable_queries: true
  query_result_response_format: protobuf
  query_sharding_target_series_per_shard: 2500
  
  results_cache:
    backend: memcached
    memcached:
      addresses: dns+grafana-mimir-results-cache.prod.svc:11211
      max_item_size: 1048576
      timeout: 500ms
      tls_enabled: true
  scheduler_address: grafana-mimir-query-scheduler-headless.prod.svc:9095
frontend_worker:
  grpc_client_config:
    max_send_msg_size: 419430400
    tls_enabled: true
  scheduler_address: grafana-mimir-query-scheduler-headless.prod.svc:9095

limits:
  cardinality_analysis_enabled: true
  max_cache_freshness: 10m
  max_query_parallelism: 240
  max_total_query_length: 12000h
  native_histograms_ingestion_enabled: true

querier:
  max_concurrent: 16
  store_gateway_client:
    tls_enabled: true

query_scheduler:
  grpc_client_config:
    tls_enabled: true
  max_outstanding_requests_per_tenant: 800


server:
  grpc_server_max_concurrent_streams: 1000
  grpc_server_max_connection_age: 2m
  grpc_server_max_connection_age_grace: 5m
  grpc_server_max_connection_idle: 1m
  
usage_stats:
  enabled: false
  installation_mode: helm

56quarters · 2024-03-18T23:20:36Z

56quarters
Mar 18, 2024
Maintainer

MimirCacheRequestErrors The cache index-cache used by Mimir grafana-mimir/devops-prod is experiencing 7.79% errors for getmulti operation. The cache chunks-cache used by Mimir grafana-mimir is experiencing 97.13% errors for getmulti operation. The cache chunks-cache used by Mimir grafana-mimir is experiencing 66.55% errors for set operation.

Internally at Grafana, we use 750ms for both the index cache and chunks cache (these are the highest traffic caches). I've been meaning to change the values in the OSS jsonnet and Helm chart.

Scaling the memcached cluster doesn't seem to resolve the timeouts. The timeouts are reduced if I increase the store-gateway's memcached client timeout, however memcached get/set latency seems to scale proportionally with this configuration, so adjusting this to something larger then 450ms seems unreasonable.

To me this seems to indicate that timeout is too short and so you're continuing to hit it at the p99 and will until you set it to something larger than however long the operations are taking.

I notice that you have TLS enabled for the cache connections. The default values are picked with plaintext connections in mind, assuming that creating a new connection is basically "free".

With TLS, you'll likely need to:

Increase the connection timeouts to something like 1s (this is what the Jsonnet does): memcached.connect_timeout: 1s
Give Memcached more CPU (default is 0.5 cores)
Give the store-gateways more CPU

0 replies

james3cardenas · 2024-03-19T23:36:41Z

james3cardenas
Mar 19, 2024
Author

Much appreciated for the response @56quarters!

I made the following adjustments based on your suggestions (starting at 12:15 in the graphs below):

    blocks_storage:
      bucket_store:
        chunks_cache:
          memcached:
            timeout: 750ms
            connect_timeout: 1s
        index_cache:
          memcached:
            timeout: 750ms
            connect_timeout: 1s

This seems to have helped a bit — more so for the index-cache. We don't currently have a CPU limit set on our chunks-cache — you can see from the graphs some pods use upward of 2-3 CPUs under heavy load. Still, the MimirCacheRequestErrors is triggering often, predominately for getmulti operations it seems. The getmulti p99 seems to be hitting an upperbound still, perhaps indicating the timeout needs to be increased further to facilitate the longer read i/o ops?

The cache chunks-cache used by Mimir grafana-mimir/devops-stg is experiencing 65.21% errors for getmulti operation.

In addition to the timeout configs, increasing store-gateway CPU resource request from 1 -> 2 (starting at 15:00 in the graphs below) doesn't seem to have much impact.

Logs from the store-gateway more frequently display the following with or without the additional CPU power:

caller=client.go:144 level=debug msg="failed to store item to cache because the async buffer is full" err="the async queue is full" size=25000
caller=client.go:129 level=debug msg="failed to store item to cache" key=subrange:prod/01HQEZC3CD5XD6NRYXKAV5B9ST/chunks/000001:19024000:19040000 sizeBytes=16000 err="read tcp 10.42.34.219:33812->10.42.34.214:11211: i/o timeout"

What are the possible implications of increasing max_async_buffer_size, should we consider increasing this? My guess is that this would only exacerbate our Memcached timeout issues.

On occasion I do see MimirSchedulerQueriesStuck getting trigger, an issue I'm assuming we can easily solve by scaling up our querier replicas?

0 replies

james3cardenas · 2024-03-21T16:05:07Z

james3cardenas
Mar 21, 2024
Author

Hi @56quarters I closed this issue out on accident but I still haven't come to a resolution. Any chance you might be able to take a look at my inquiries in bold above? Thanks!

0 replies

56quarters · 2024-03-21T16:13:28Z

56quarters
Mar 21, 2024
Maintainer

The getmulti p99 seems to be hitting an upperbound still, perhaps indicating the timeout needs to be increased further to facilitate the longer read i/o ops?

That'd be my guess. To troubleshoot this I'd keep adjusting the read timeout and connection timeout up until almost all requests succeed. Then we can adjust it back to something reasonable based on looking at how long requests and connections take at steady-state.

What are the possible implications of increasing max_async_buffer_size, should we consider increasing this? My guess is that this would only exacerbate our Memcached timeout issues.

I'd leave this alone for now because it's a symptom of things being slow. It shouldn't be required once we've got things working more reliably.

On occasion I do see MimirSchedulerQueriesStuck getting trigger, an issue I'm assuming we can easily solve by scaling up our querier replicas?

Another symptom of "things are slow". I'd leave this for now until we get caching sorted out.

0 replies

james3cardenas · 2024-03-22T23:27:03Z

james3cardenas
Mar 22, 2024
Author

Thanks once again for your feedback @56quarters.

After troubleshooting a bit I landed on the following config changes:

chunks_cache:
  memcached:
    timeout: 7s
    connect_timeout: 4s
index_cache:
  memcached:
    timeout: 7s
    connect_timeout: 4s

With these configs the MimirCacheRequestErrors have all but disappeared from our index/chunks caches & the p99 plateau we were seeing before is no longer appearing in the graphs. However, it's not very clear to me if these values are unreasonable nor if setting them so high will negatively affect overall query latency?

Also, despite having resolved the index/chunks cache timeouts, I'm still seeing the following error in store-gateways pretty frequently:

ts=2024-03-22T22:57:13.610187019Z caller=client.go:144 level=debug msg="failed to store item to cache because the async buffer is full" err="the async queue is full" size=25000

The store-gateway as well as all other components on the read path seem to be just fine in terms and CPU & Mem resources, but it still seems like Mimir is struggling to handle our range queries (ex. 7d, 14d, 30d). During these longer ranged queries we see "the async queue is full" error and also the MimirRequestLatency & MimirSchedulerQueriesStuck alerts transitioning into the pending states. Any insights you might be able to offer on this would be much appreciated @56quarters!

MimirRequestLatency
prod/querier prometheus_api_v1_query_range is experiencing 9.21s 99th percentile latency.
prod/query-frontend prometheus_api_v1_query_range is experiencing 9.90s 99th percentile latency.
prod/store-gateway /gatewaypb.StoreGateway/Series is experiencing 22.17s 99th percentile latency.

MimirSchedulerQueriesStuck
There are 50 queued up queries in grafana-mimir/prod prod/query-scheduler.

0 replies

james3cardenas · 2024-03-25T16:36:13Z

james3cardenas
Mar 25, 2024
Author

Hi @56quarters, wondering if you might have any feedback on the above? ^

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mimir Read Latency Errors (MimirCacheRequestErrors & MimirRequestLatency) #7687

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Mimir Read Latency Errors (MimirCacheRequestErrors & MimirRequestLatency) #7687

james3cardenas Mar 18, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

Replies: 6 comments

56quarters Mar 18, 2024 Maintainer

james3cardenas Mar 19, 2024 Author

james3cardenas Mar 21, 2024 Author

56quarters Mar 21, 2024 Maintainer

james3cardenas Mar 22, 2024 Author

james3cardenas Mar 25, 2024 Author

james3cardenas
Mar 18, 2024

56quarters
Mar 18, 2024
Maintainer

james3cardenas
Mar 19, 2024
Author

james3cardenas
Mar 21, 2024
Author

56quarters
Mar 21, 2024
Maintainer

james3cardenas
Mar 22, 2024
Author

james3cardenas
Mar 25, 2024
Author