Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large Cluster State Renders Cluster Unresponsive #5446

Open
esatterwhite opened this issue Sep 24, 2024 · 1 comment
Open

Large Cluster State Renders Cluster Unresponsive #5446

esatterwhite opened this issue Sep 24, 2024 · 1 comment
Labels
bug Something isn't working

Comments

@esatterwhite
Copy link
Collaborator

esatterwhite commented Sep 24, 2024

Describe the bug
Crashing nodes accumulate in cluster state as dead nodes for a long while. As the state grows it takes longer and more memory to synchronize across the cluster. Particularly when a new node is added (or restarts). We have seen initial sync times upwards of 20+ seconds when the cluster is under stress And the cluster state being in the 5-10MB range. In many cases, this will cause kubernetes to kill the pod causing another restart. Which adds another dead + new node to the state.

This constant start up loop also puts a lot of pressure on the control plane and it can become slow to respond - At which point the cluster becomes mostly unresponsive.

We have seen this problem in situations when the database server (postgres) is unavailable or unresponsive for a period of time. It being down causes all things in the write path to crash repeatedly which leads to the growing state problem. The only course of action seems to be to stop everything / scale all components to 0 so there is no starting cluster state.
image

Steps to reproduce (if applicable)
Steps to reproduce the behavior:
Our ingestion setup is 3 metastore nodes, 1 postgres primary, 30 Indexer pods, and 1 control plane node. We typically are doing 500MB/s ingest across 3K indexes.

After running for some time, kill the postgres instance and let be down for 10 minutes to allow the cluster state accumulate a few hundred dead nodes. Restart postgres wait for ingestion to become healthy again.

Expected behavior
If it isn't really feasible to prune the dead node list or size of the state because of the details around scuttlebut, the cluster should be much more tolerable of the database being down. Crashing may not be the best course of action as it ultimately prevents the cluster from ever becoming healthy again.

Configuration:

// build info
{
  "build": {
    "build_date": "2024-09-19T16:35:19Z",
    "build_profile": "release",
    "build_target": "aarch64-unknown-linux-gnu",
    "cargo_pkg_version": "0.8.0",
    "commit_date": "2024-09-19T15:34:03Z",
    "commit_hash": "f3dd8f27129a418a0d30642f8e7da2235d37cfaf",
    "commit_short_hash": "f3dd8f2",
    "commit_tags": [
      "qw-airmail-20240919"
    ],
    "version": "0.8.0-nightly"
  },
  "runtime": {
    "num_cpus": 16,
    "num_threads_blocking": 14,
    "num_threads_non_blocking": 2
  }
}
-- node.yaml
version: '0.8'
listen_address: 0.0.0.0
grpc:
  max_message_size: 100MiB # default 20 MiB
metastore:
  postgres:
    max_connections: 50
searcher:
  fast_field_cache_capacity: 4G # default 1G
  aggregation_memory_limit: 1000M # default 500M
  partial_request_cache_capacity: 64M # default 64M
  max_num_concurrent_split_searches: 20  # default 100
  split_cache:
    max_num_bytes: 7500G
    max_num_splits: 1000000
    num_concurrent_downloads: 6
indexer:
  enable_cooperative_indexing: true
  cpu_capacity: 1000m
  merge_concurrency: 3
ingest_api:
  max_queue_disk_usage: 8GiB # default 2Gib
  max_queue_memory_usage: 4GiB # default 1Gib
  shard_throughput_limit: 5mb #default 5mb
storage:
  s3:
    region: us-east-1
    endpoint: https://s3.amazonaws.com
@esatterwhite esatterwhite added the bug Something isn't working label Sep 24, 2024
@esatterwhite
Copy link
Collaborator Author

related to #5135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant