Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: message count read in forwarder #1030

Merged
merged 5 commits into from
Sep 14, 2023
Merged

Conversation

veds-g
Copy link
Contributor

@veds-g veds-g commented Sep 11, 2023

Fixes #1026

readMessagesCount in forward.go was adding both data messages as well as higher control messages(for low throughput pipelines with multiple partitions) leading to a higher processing rate calculated and shown for the sink vertex

count of messages read now is just based on data messages
image

@veds-g
Copy link
Contributor Author

veds-g commented Sep 11, 2023

higher initial rate was seen for map vertex initially with multiple partitionsimage

the same pipeline in the issue with higher throughput showed correct processing rates
image

Copy link
Member

@vigith vigith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we separately track control messages?

@vigith
Copy link
Member

vigith commented Sep 11, 2023

perhaps let's publish total read metrics and also data messages, the difference should be ctrl mesgs.

@KeranYang
Copy link
Member

Should we enable multi-partition for the rater e2e test pipeline so that we catch similar issues in future? https://github.com/numaproj/numaflow/blob/main/test/e2e/functional_test.go#L129

@KeranYang
Copy link
Member

Perhaps let's publish total read metrics and also data messages, the difference should be ctrl msgs.

I think the ctrl message is the implementation details of Numaflow so we probably shouldn't expose it to users... Now that we exclude the ctrl messages from the read total metric, it's accurate from the user's standpoint that the pipeline read N data messages. An issue with the current approach could be the pipeline gets throttled because of high load and user can't see the real load from the metric because ctrl msgs are excluded but that won't happen because ctrl msgs are very small portion of a high throughput pipeline.

@vigith
Copy link
Member

vigith commented Sep 11, 2023

An issue with the current approach could be the pipeline gets throttled because of high load and user can't see the real load from the metric because ctrl msgs are excluded but that won't happen because ctrl msgs are very small portion of a high throughput pipeline.

We should expose metrics even though it is not useful to users. It is quite important for us to debug if we are seeing higher than normal ctrl messages. We expose a lot of metrics that are not valid to users, e.g., WAL count, etc.

@vigith
Copy link
Member

vigith commented Sep 13, 2023

perhaps let's publish total read metrics and also data messages, the difference should be ctrl mesgs.

@veds-g why not just expose total read messages rather than just ctrl-messages?

@veds-g
Copy link
Contributor Author

veds-g commented Sep 13, 2023

@veds-g why not just expose total read messages rather than just ctrl-messages?

no particular reason, I thought we wanted control separately. Should I go for the total instead?

@vigith
Copy link
Member

vigith commented Sep 13, 2023

no particular reason, I thought we wanted control separately. Should I go for the total instead?

go for the total because it represents everything.

Copy link
Member

@vigith vigith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @KeranYang please review

Copy link
Member

@KeranYang KeranYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. TY for fixing this issue. There is one comment that I want to address but shouldn't be a blocker for merging this PR. I will work on it myself when I have time.

@vigith vigith merged commit 4369d65 into numaproj:main Sep 14, 2023
17 checks passed
dpadhiar pushed a commit to dpadhiar/numaflow that referenced this pull request Sep 14, 2023
whynowy pushed a commit that referenced this pull request Sep 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Raptor generates wrong processing rate when multi-partition isb enabled
5 participants