Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-16143: New JMX metrics for AsyncKafkaConsumer #17199

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

FrankYang0529
Copy link
Member

Add following metrics to AsyncKafkaConsumer:

  • time-between-network-thread-poll-avg
  • time-between-network-thread-poll-max
  • application-event-queue-size
  • application-event-queue-time-avg
  • application-event-queue-time-max
  • application-event-queue-processing-time-avg
  • application-event-queue-processing-time-max
  • unsent-requests-queue-size
  • unsent-requests-queue-time-avg
  • unsent-requests-queue-time-max
  • background-event-queue-size
  • background-event-queue-time-avg
  • background-event-queue-time-max
  • background-event-queue-processing-time-avg
  • background-event-queue-processing-time-max

Committer Checklist (excluded from commit message)

  • Verify design and implementation
  • Verify test coverage and CI build status
  • Verify documentation (including upgrade notes)

@lianetm lianetm added consumer KIP-848 The Next Generation of the Consumer Rebalance Protocol ctr Consumer Threading Refactor (KIP-848) labels Sep 16, 2024
Copy link
Collaborator

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FrankYang0529 for working on this!

I am intentionally not going to look at this too in-depth because it's a draft at the moment. However, it appears that the KafkaConsumerMetrics object is being used in both the application thread and background thread. From a quick glance, KafkaConsumerMetrics doesn't look thread-safe 🤔

If that's the case, we'll need to figure out how we can avoid potential synchronization issues.

Thanks!

@kirktrue
Copy link
Collaborator

kirktrue commented Oct 8, 2024

@FrankYang0529—I wanted to check in and see if you had any questions or needed anything to progress on this. Thanks!

@FrankYang0529
Copy link
Member Author

Hi @kirktrue, sorry for late. I will make this PR ready today.

@FrankYang0529
Copy link
Member Author

If that's the case, we'll need to figure out how we can avoid potential synchronization issues.

Hi @kirktrue, good catch. There is synchronized in Sensor#recordInternal. In this case, do we still need a lock in KafkaConsumerMetrics?

private void recordInternal(double value, long timeMs, boolean checkQuotas) {
this.lastRecordTime = timeMs;
synchronized (this) {
synchronized (metricLock()) {
// increment all the stats
for (StatAndConfig statAndConfig : this.stats) {
statAndConfig.stat.record(statAndConfig.config(), value, timeMs);
}
}
if (checkQuotas)
checkQuotas(timeMs);
}
for (Sensor parent : parents)
parent.record(value, timeMs, checkQuotas);
}

I wanted to check in and see if you had any questions or needed anything to progress on this.

The PR is almost ready. I will wait to see CI result. Thanks.

@FrankYang0529 FrankYang0529 marked this pull request as ready for review October 9, 2024 16:30
Copy link
Collaborator

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @FrankYang0529!

This looks pretty comprehensive. I haven't had time to dive as deep as I would like, but I left a first pass of comments.

Thanks!

Comment on lines 1756 to 1914
kafkaConsumerMetrics.recordBackgroundEventQueueTime(time.milliseconds() - event.addedToQueueMs());
long startMs = time.milliseconds();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the metric determine how long it takes to process the background event queue, or how long it takes to process background events? If it's the latter, we want to update the metric inside each loop, but if it's the former we should update once outside the loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this metric determine how long a background event is taking to be dequeued. In BackgroundEventHandler#add, we run BackgroundEvent#setAddedToQueueMs. When we start to process it, we can use current time - event.addedToQueueMs to know how long an event is in the queue.

@@ -139,6 +155,9 @@ void runOnce() {
processApplicationEvents();

final long currentTimeMs = time.milliseconds();
final long timeSinceLastPollMs = lastPollTimeMs != 0L ? currentTimeMs - lastPollTimeMs : currentTimeMs;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the first invocation of runOnce(), timeSinceLastPollMs will be something like 1728954137284. Is that correct?

Comment on lines 94 to 102
public ConsumerNetworkThread(LogContext logContext,
Time time,
BlockingQueue<ApplicationEvent> applicationEventQueue,
CompletableEventReaper applicationEventReaper,
Supplier<ApplicationEventProcessor> applicationEventProcessorSupplier,
Supplier<NetworkClientDelegate> networkClientDelegateSupplier,
Supplier<RequestManagers> requestManagersSupplier) {
this(logContext, time, applicationEventQueue, applicationEventReaper, applicationEventProcessorSupplier,
networkClientDelegateSupplier, requestManagersSupplier, Optional.empty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to remove this constructor? Can't we call the existing constructor with Optional.empty()?

Comment on lines 94 to 101
public NetworkClientDelegate(
final Time time,
final ConsumerConfig config,
final LogContext logContext,
final KafkaClient client,
final Metadata metadata,
final BackgroundEventHandler backgroundEventHandler) {
this(time, config, logContext, client, metadata, backgroundEventHandler, Optional.empty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to remove this constructor? Can't callers invoke the existing constructor with Optional.empty()?

Comment on lines 65 to 78
public void setAddedToQueueMs(long addedToQueueMs) {
this.addedToQueueMs = addedToQueueMs;
}

public long addedToQueueMs() {
return addedToQueueMs;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void setAddedToQueueMs(long addedToQueueMs) {
this.addedToQueueMs = addedToQueueMs;
}
public long addedToQueueMs() {
return addedToQueueMs;
}
public void setEnqueuedMs(long enqueuedMs) {
enqueuedMs = enqueuedMs;
}
public long enqueuedMs() {
return enqueuedMs;
}

@@ -73,7 +90,9 @@ public ApplicationEventHandler(final LogContext logContext,
*/
public void add(final ApplicationEvent event) {
Objects.requireNonNull(event, "ApplicationEvent provided to add must be non-null");
event.setAddedToQueueMs(System.currentTimeMillis());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to get the current time in milliseconds from the Time object that was passed in to the constructor.

Comment on lines 338 to 335
void setAddedToQueueMs(final long addedToQueueMs) {
this.addedToQueueMs = addedToQueueMs;
}

long addedToQueueMs() {
return addedToQueueMs;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's a bit nit-picky, but can we change it to:

Suggested change
void setAddedToQueueMs(final long addedToQueueMs) {
this.addedToQueueMs = addedToQueueMs;
}
long addedToQueueMs() {
return addedToQueueMs;
}
void setEnqueuedMs(final long enqueuedMs) {
this.enqueuedMs = enqueuedMs;
}
long enqueuedMs() {
return enqueuedMs;
}

Comment on lines 55 to 66
public void setAddedToQueueMs(long addedToQueueMs) {
this.addedToQueueMs = addedToQueueMs;
}

public long addedToQueueMs() {
return addedToQueueMs;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
public void setAddedToQueueMs(long addedToQueueMs) {
this.addedToQueueMs = addedToQueueMs;
}
public long addedToQueueMs() {
return addedToQueueMs;
}
public void setEnqueuedMs(long enqueuedMs) {
this. enqueuedMs = enqueuedMs;
}
public long enqueuedMs() {
return enqueuedMs;
}

Comment on lines 43 to 44
public BackgroundEventHandler(final Queue<BackgroundEvent> backgroundEventQueue) {
this(backgroundEventQueue, Optional.empty());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the other changes. I'd rather make the callers pass in Optional.empty().

@@ -42,6 +51,8 @@ public BackgroundEventHandler(final Queue<BackgroundEvent> backgroundEventQueue)
*/
public void add(BackgroundEvent event) {
Objects.requireNonNull(event, "BackgroundEvent provided to add must be non-null");
event.setAddedToQueueMs(System.currentTimeMillis());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment here: we need to update the constructor to provide a Time object and then use that here.

@FrankYang0529
Copy link
Member Author

Hi @kirktrue, I address all comments. Could you take a look when you have time? Thank you.

Copy link
Collaborator

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @FrankYang0529!

Overall, I think it's headed in the right direction. I have mostly minor requests, but the one that's slightly larger is class hierarchy around the KafkaConsumerMetrics. If you want to punt on that, file a Jira and I'll take care of it later.

I didn't get to the unit tests yet, though 😢

Thanks!

Comment on lines +65 to +73
public void setEnqueuedMs(long enqueuedMs) {
this.enqueuedMs = enqueuedMs;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like that we're introducing mutability into all the events. I understand that this changes a lot less code than having the constructor take the creation time or something.

Can you add a comment to the enqueued variable that states that because of its mutability that it should not be used in hashCode() or equals() or toStringBase() (or something along those lines)?

@@ -37,6 +37,7 @@ public enum Type {
* {@link #equals(Object)} and can be used in log messages when debugging.
*/
private final Uuid id;
private long enqueuedMs;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as per above with ApplicationEvent.

kafkaConsumerMetrics.ifPresent(metrics -> metrics.recordApplicationEventQueueTime(time.milliseconds() - event.enqueuedMs()));
long startMs = time.milliseconds();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we swap the ordering of these two lines to avoid the extra call to time.milliseconds()?

Suggested change
kafkaConsumerMetrics.ifPresent(metrics -> metrics.recordApplicationEventQueueTime(time.milliseconds() - event.enqueuedMs()));
long startMs = time.milliseconds();
long startMs = time.milliseconds();
kafkaConsumerMetrics.ifPresent(metrics -> metrics.recordApplicationEventQueueTime(startMs - event.enqueuedMs()));

@@ -162,15 +171,20 @@ void runOnce() {
private void processApplicationEvents() {
LinkedList<ApplicationEvent> events = new LinkedList<>();
applicationEventQueue.drainTo(events);
kafkaConsumerMetrics.ifPresent(metrics -> metrics.recordApplicationEventQueueSize(applicationEventQueue.size()));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we drain applicationEventQueue, it'll be empty, right? If, so why not just do:

Suggested change
kafkaConsumerMetrics.ifPresent(metrics -> metrics.recordApplicationEventQueueSize(applicationEventQueue.size()));
kafkaConsumerMetrics.ifPresent(metrics -> metrics.recordApplicationEventQueueSize(0));

private ApplicationEventProcessor applicationEventProcessor;
private NetworkClientDelegate networkClientDelegate;
private RequestManagers requestManagers;
private volatile boolean running;
private final IdempotentCloser closer = new IdempotentCloser();
private volatile Duration closeTimeout = Duration.ofMillis(DEFAULT_CLOSE_TIMEOUT_MS);
private volatile long cachedMaximumTimeToWait = MAX_POLL_TIMEOUT_MS;
private long lastPollTimeMs = 0L;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only ever written to and read from the same thread, right?

@@ -42,6 +47,8 @@ public BackgroundEventHandler(final Queue<BackgroundEvent> backgroundEventQueue)
*/
public void add(BackgroundEvent event) {
Objects.requireNonNull(event, "BackgroundEvent provided to add must be non-null");
event.setEnqueuedMs(System.currentTimeMillis());
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to add a Time variable to this class so that we can do this instead:

Suggested change
event.setEnqueuedMs(System.currentTimeMillis());
event.setEnqueuedMs(time.milliseconds());

import java.util.concurrent.TimeUnit;

import static org.apache.kafka.clients.consumer.internals.ConsumerUtils.CONSUMER_METRICS_SUFFIX;

public class KafkaConsumerMetrics implements AutoCloseable {
private final GroupProtocol groupProtocol;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to create a subclass of KafkaConsumerMetrics instead of adding group protocol-specific bits on here? I see that the ShareConsumerImpl uses a custom KafkaShareConsumerMetrics that at first glance is entirely a subset of KafkaConsumerMetrics, so there's some refactoring that could be done here.

Comment on lines 56 to 64
private Sensor timeBetweenNetworkThreadPollSensor;
private Sensor applicationEventQueueSizeSensor;
private Sensor applicationEventQueueTimeSensor;
private Sensor applicationEventQueueProcessingTimeSensor;
private Sensor backgroundEventQueueSizeSensor;
private Sensor backgroundEventQueueTimeSensor;
private Sensor backgroundEventQueueProcessingTimeSensor;
private Sensor unsentRequestsQueueSizeSensor;
private Sensor unsentRequestsQueueTimeSensor;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make these final too? It'll be a bit ugly in the else block of the constructor to mark them all as null. But that goes away if we make this a subclass.

Comment on lines 53 to 78
metricGroupName, "The number of seconds since the last poll() invocation.");
metricGroupName, "The number of seconds since the last poll() invocation.");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove these whitespace changes? They seem extraneous.

Comment on lines 262 to 275

if (groupProtocol == GroupProtocol.CONSUMER) {
Arrays.asList(
timeBetweenNetworkThreadPollSensor.name(),
applicationEventQueueSizeSensor.name(),
applicationEventQueueTimeSensor.name(),
applicationEventQueueProcessingTimeSensor.name(),
backgroundEventQueueSizeSensor.name(),
backgroundEventQueueTimeSensor.name(),
backgroundEventQueueProcessingTimeSensor.name(),
unsentRequestsQueueSizeSensor.name(),
unsentRequestsQueueTimeSensor.name()
).forEach(metrics::removeSensor);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is another place that having a subclass would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clients consumer ctr Consumer Threading Refactor (KIP-848) KIP-848 The Next Generation of the Consumer Rebalance Protocol
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants