Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Johnny/dp timeout #1720

Merged
merged 1 commit into from
Oct 21, 2024
Merged

Johnny/dp timeout #1720

merged 1 commit into from
Oct 21, 2024

Conversation

psFried
Copy link
Member

@psFried psFried commented Oct 18, 2024

Wraps connector and gazette client dials in timeouts, since the connect_timeout grpc options are insufficient to cover all cases that might block indefinitely. This is intended to fix issues with agents that stop making progress.


This change is Reviewable

@psFried psFried requested a review from jshearer October 18, 2024 17:48
Copy link
Contributor

@jshearer jshearer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Interesting, this smells very similar to the issues I was seeing with Dekaf where after some period of time gRPC requests would just hang, and my temporary work-around was to just dial a whole new TCP connection for each session to get around it.

let result = tokio::time::timeout(Duration::from_secs(10), async {
match ep.uri().scheme_str() {
Some("unix") => ep
.connect_with_connector(tower::util::service_fn(|uri: tonic::transport::Uri| {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hah, I didn't even know this was supposed to be timing out at all -- I added a timeout in Dekaf for this as well.

Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@psFried psFried added the change:emergency This change is an emergency label Oct 18, 2024
Ensure we apply top-level timeouts to actions undertaken within a data-plane:
- Staring a connector proxy environment.
- Awaiting a connector unary response.
- Activating built catalog specifications into a data-plane
- Deleting built specifications from a data-plane.

We want to preserve a firm boundary that data-planes are falliable and
control-plane operations must protect themselves from an unavailability
or undesired behavior of a data-plane.
@jgraettinger jgraettinger added change:unplanned This change is unplanned, useful for things like docs and removed change:emergency This change is an emergency labels Oct 21, 2024
@jgraettinger jgraettinger merged commit b9cfdcc into master Oct 21, 2024
4 checks passed
@jgraettinger jgraettinger deleted the johnny/dp-timeout branch October 21, 2024 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
change:unplanned This change is unplanned, useful for things like docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants