You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello OpenMessaging maintainers - thank you for your work on this project!
I'm running some very large benchmarks (20k consumers) spread out over 8 very large machines to simulate a large scale test. As the test nears completion (I suspect when results are aggregated) there are numerous consumer errors. These result in the WorkloadGenerator timing out.
Even with smaller tests, I see the timeouts due consumer errors at the end of the tests.
For example, I can see the aggregate high level stats are OK:
There are no errors, no backlog, and steady throughput at the rate I've specified. However, the test times out and the consumer logs show errors. Note that with high throughput tests and fewer numbers of consumers I do not see the issue.
Example Test Setup
Driver:
name: MyTestDriverdriverClass: io.openmessaging.benchmark.driver.kafka.KafkaBenchmarkDriver# Kafka client-specific configurationreplicationFactor: 3reset: truetopicConfig: | min.insync.replicas=2commonConfig: | bootstrap.servers=myserver:9092 # add additional time for large scale tests. request.timeout.ms=480000 socket.connection.setup.timeout.ms = 30000 socket.connection.setup.timeout.max.ms = 60000producerConfig: | acks=all linger.ms=1 batch.size=1048576consumerConfig: | auto.offset.reset=earliest enable.auto.commit=false max.partition.fetch.bytes=10485760 # try to fix the issue of consumer not consuming messages max.poll.records = 50 max.poll.interval.ms = 300000 #session.timeout.ms = 90000 #group.min.session.timeout.ms=6000 #group.max.session.timeout.ms=92000
Workload:
name: services-1# overprovision partitions to rule out contentionpartitionsPerTopic: 15messageSize: 1024payloadFile: "payload/payload-1Kb.data"subscriptionsPerTopic: 1consumerPerSubscription: 12producersPerTopic: 1producerRate: 9000consumerBacklogSizeGB: 0warmupDurationMinutes: 1testDurationMinutes: 3
Consumer Errors
Some of the consumer errors include:
15:24:15.049 [pool-116-thread-1] ERROR ConsumerCoordinator - [Consumer clientId=consumer-sub-000-Pp8hAjg-115, groupId=sub-000-Pp8hAjg] Offset commit with offsets {test-topic-0000022-tQSNfvA-0=OffsetAndMetadata{offset=2082, leaderEpoch=null, metadata=''}} failed
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing the latest consumed offsets.
Caused by: org.apache.kafka.common.errors.DisconnectException
15:24:14.921 [pool-13-thread-1] ERROR ConsumerCoordinator - [Consumer clientId=consumer-sub-000-trBz_PI-12, groupId=sub-000-trBz_PI] Offset commit with offsets {test-topic-0000141-ZzmWoJA-0=OffsetAndMetadata{offset=2080, leaderEpoch=null, metadata=''}} failed
org.apache.kafka.clients.consumer.RetriableCommitFailedException: Offset commit failed with a retriable exception. You should retry committing the latest consumed offsets.
Caused by: org.apache.kafka.common.errors.NotCoordinatorException: This is not the correct coordinator.
I've attached output and logs of a test that exhibits these symptoms.
Would you have any suggestions for running the benchmarks with extremely large numbers of consumers? Happy to provide more information if you need it.
Edit: Perhaps all of the rebalancing is blocking some requests around shutdown/closing consumers?
Thanks!
The text was updated successfully, but these errors were encountered:
ColinSullivan1
changed the title
Errors with Large Consumer Counts Prevent Test Completion (Kafka Driver)
Large Number of Consumers Prevent Test Completion (Kafka Driver)
Aug 27, 2024
FYI, I tried setting the internal property, internal.leave.group.on.close = false which didn't seem to make a difference on the system I was benchmarking.
Adding some instrumentation showed that the Kafka driver's close API was taking up to 800 milliseconds to complete. Waiting for all to consumers to close would have taken quite a long time.
One more data point - as an experiment, I commented out consumer.close() in the Kafka driver, and this test made it completion.
I'm wondering if gathering stats and writing the output file before closing the consumers would work, followed by invoking the consumer.close() APIs in an executor to parallelize work that occurs during the close.
Another option might be to use the close API that accepts a duration, but with 20k consumers I'm not sure if that'd help enough on its own.
I notice theworker.stopAll() API is called in the WorkloadGenerator.run() before the method exits (and before the results file is generated). This prevents the results from from being generated as worker.stopAll() takes a very long time to complete and the benchmark times out. Note that worker.stopAll() is also called later on during the workload shutdown.
Removing this line allows me generate the results file. The benchmark still times out from the subsequent worker.stopAll() call is made though, but at least I can get results.
Is worker.stopAll() necessary in WorkloadGenerator.run()?
Hello OpenMessaging maintainers - thank you for your work on this project!
I'm running some very large benchmarks (20k consumers) spread out over 8 very large machines to simulate a large scale test. As the test nears completion (I suspect when results are aggregated) there are numerous consumer errors. These result in the WorkloadGenerator timing out.
Even with smaller tests, I see the timeouts due consumer errors at the end of the tests.
For example, I can see the aggregate high level stats are OK:
e.g.
There are no errors, no backlog, and steady throughput at the rate I've specified. However, the test times out and the consumer logs show errors. Note that with high throughput tests and fewer numbers of consumers I do not see the issue.
Example Test Setup
Driver:
Workload:
Consumer Errors
Some of the consumer errors include:
I've attached output and logs of a test that exhibits these symptoms.
benchmark-output.txt
sanitized-benchmark-worker.log
Would you have any suggestions for running the benchmarks with extremely large numbers of consumers? Happy to provide more information if you need it.
Edit: Perhaps all of the rebalancing is blocking some requests around shutdown/closing consumers?
Thanks!
The text was updated successfully, but these errors were encountered: