Mimir Read Latency Errors (MimirCacheRequestErrors & MimirRequestLatency) #7687
Replies: 6 comments
-
Internally at Grafana, we use
To me this seems to indicate that timeout is too short and so you're continuing to hit it at the p99 and will until you set it to something larger than however long the operations are taking. I notice that you have TLS enabled for the cache connections. The default values are picked with plaintext connections in mind, assuming that creating a new connection is basically "free". With TLS, you'll likely need to:
|
Beta Was this translation helpful? Give feedback.
-
Much appreciated for the response @56quarters! I made the following adjustments based on your suggestions (starting at 12:15 in the graphs below): blocks_storage:
bucket_store:
chunks_cache:
memcached:
timeout: 750ms
connect_timeout: 1s
index_cache:
memcached:
timeout: 750ms
connect_timeout: 1s This seems to have helped a bit — more so for the index-cache. We don't currently have a CPU limit set on our chunks-cache — you can see from the graphs some pods use upward of 2-3 CPUs under heavy load. Still, the MimirCacheRequestErrors is triggering often, predominately for getmulti operations it seems. The getmulti p99 seems to be hitting an upperbound still, perhaps indicating the timeout needs to be increased further to facilitate the longer read i/o ops?
In addition to the timeout configs, increasing store-gateway CPU resource request from 1 -> 2 (starting at 15:00 in the graphs below) doesn't seem to have much impact. Logs from the store-gateway more frequently display the following with or without the additional CPU power:
What are the possible implications of increasing On occasion I do see MimirSchedulerQueriesStuck getting trigger, an issue I'm assuming we can easily solve by scaling up our querier replicas? |
Beta Was this translation helpful? Give feedback.
-
Hi @56quarters I closed this issue out on accident but I still haven't come to a resolution. Any chance you might be able to take a look at my inquiries in bold above? Thanks! |
Beta Was this translation helpful? Give feedback.
-
That'd be my guess. To troubleshoot this I'd keep adjusting the read timeout and connection timeout up until almost all requests succeed. Then we can adjust it back to something reasonable based on looking at how long requests and connections take at steady-state.
I'd leave this alone for now because it's a symptom of things being slow. It shouldn't be required once we've got things working more reliably.
Another symptom of "things are slow". I'd leave this for now until we get caching sorted out. |
Beta Was this translation helpful? Give feedback.
-
Thanks once again for your feedback @56quarters. After troubleshooting a bit I landed on the following config changes: chunks_cache:
memcached:
timeout: 7s
connect_timeout: 4s
index_cache:
memcached:
timeout: 7s
connect_timeout: 4s With these configs the MimirCacheRequestErrors have all but disappeared from our index/chunks caches & the p99 plateau we were seeing before is no longer appearing in the graphs. However, it's not very clear to me if these values are unreasonable nor if setting them so high will negatively affect overall query latency? Also, despite having resolved the index/chunks cache timeouts, I'm still seeing the following error in store-gateways pretty frequently:
The store-gateway as well as all other components on the read path seem to be just fine in terms and CPU & Mem resources, but it still seems like Mimir is struggling to handle our range queries (ex. 7d, 14d, 30d). During these longer ranged queries we see "the async queue is full" error and also the MimirRequestLatency & MimirSchedulerQueriesStuck alerts transitioning into the pending states. Any insights you might be able to offer on this would be much appreciated @56quarters!
|
Beta Was this translation helpful? Give feedback.
-
Hi @56quarters, wondering if you might have any feedback on the above? ^ |
Beta Was this translation helpful? Give feedback.
-
Describe the bug
I'm trying to understand how best to improve our Mimir read request latency and have been unable to make much progress on resolving the following error alerts for long range queries (ex. 30 days).
Details below (any additional details available upon request):
MimirCacheRequestErrors
The cache index-cache used by Mimir grafana-mimir/devops-prod is experiencing 7.79% errors for getmulti operation.
The cache chunks-cache used by Mimir grafana-mimir is experiencing 97.13% errors for getmulti operation.
The cache chunks-cache used by Mimir grafana-mimir is experiencing 66.55% errors for set operation.
Scaling the memcached cluster doesn't seem to resolve the timeouts. The timeouts are reduced if I increase the store-gateway's memcached client timeout, however memcached get/set latency seems to scale proportionally with this configuration, so adjusting this to something larger then 450ms seems unreasonable.
MimirRequestLatency
querier prometheus_api_v1_query_range is experiencing 24.69s 99th percentile latency.
store-gateway /gatewaypb.StoreGateway/Series is experiencing 46.06s 99th percentile latency.
It's unclear to me whether fixing the memcached timeouts will help resolve this alert. I can sometimes see the Queue length panel suggesting we have queries that are waiting in queue due to busy queriers, and have considered scaling the queriers up but I want to be sure that this doesn't cause further issues downstream (ex. causing higher load on store-gateways, ingesters, and/or query frontend).
To Reproduce
Steps to reproduce the behavior:
Expected behavior
I expect the MimirCacheRequestErrors & MimirRequestLatency not to trigger.
Environment
Additional Context
values.yaml
mimir.yaml
Beta Was this translation helpful? Give feedback.
All reactions