Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event stream reconnection fixes #6777

Merged

Conversation

adriansmares
Copy link
Contributor

@adriansmares adriansmares commented Dec 17, 2023

Summary

References #6776
References #6752

Changes

  • Add context error propagation (to be used with Implement Custom Error Codes for WebSocket API #6752 , if we can get that work properly).
  • Add server side websocket pings.
    • They keep the connection open on low traffic streams in order to avoid having periodic reconnections (every 10 minutes depending on the browser).
  • Kill event streams when a subscription request from an invalid caller (i.e. token expired) arrives.
    • This ensures that the client doesn’t need to manually kill connections after refreshing.
  • Add conversion methods for the websocket stream error types, along with dedicated error types.
  • Catch TokenErrors properly during connection.
  • Always retry connections if the events start failure occurs.
    • This is extremely relevant if the error is not a timeout (i.e. if the endpoint is unavailable).
  • Fix stream closed validation logic (no return and bad condition).
  • Simplify event filter logic, as close cannot throw.
  • Failing to start events now marks the stream as interrupted.
  • An event stream error is simply informative now, without any logic. Stream closure drives reconnection.
  • Decouple subscriptions, from instances, from URLs in the WebSocket control functions.
    • This ensures that we don’t mix up currently active connections and subscriptions.
  • Clarify stream implementation details and guarantees.
  • Force event listeners to be pre-provided while establishing a connection, in order to avoid race conditions (events being emitted before the listeners are attached).
  • Extend the logic used when combining multiple streams to take into account that one stream should externally ‘look’ as if it was one stream with respect to the emitted events. Also ensure that we properly clean up failed attempts.
  • Connections are no longer closed when the last subscription is closed.
    • This is very important for UX in non eu1 regions, where the connection setup can take up to 700 ms, which translates to 700 ms extra on every page load for a gateway or application view.

Testing

Tested over the weekend in TTSCE, on both eu1 and nam1, using both low and high traffic streams. Streams were kept open for over 12 hours in order to monitor token refresh, resubscription and connection stability.

Many page loads were involved, but I recommend locally testing using the following two changes:

  1. What if the Console always rejects your subscribe attempt, immediately ?
    Comment out the following line:

    c.RegisterWeb(events.New(c))

    The backend will now instantly reject your connection attempts. The Console should still attempt to reconnect in such situations. It can be the case that this happens during upgrades or downtimes.

  2. What if connections take really long to establish, but only sometimes ?
    Add the following line:

	time.Sleep(random.Jitter(20*time.Second, 0.6))

Just at the first of this function:

func (h *eventsHandler) handleEvents(w http.ResponseWriter, r *http.Request) {

The backend will now act inherently slowed down - sometimes your connections will take 4 seconds to establish, sometimes 16 seconds. The Console must be able to establish a connection with the backend, at some point, even when the backend is so erratic.

I ran the tests above and I recommend that we use them going forward when we make changes to the event streams. I do not recommend dedicated release testing for this PR - it fixes some very specific race conditions which cannot be manually reproduced with any form of ease.

Regressions

I've edited the logics quite a lot here, without having a huge amount of experience with this. While my manual testing has been quite thorough, I do need a second pair of eyes on these changes.

Checklist

  • Scope: The referenced issue is addressed, there are no unrelated changes.
  • Compatibility: The changes are backwards compatible with existing API, storage, configuration and CLI, according to the compatibility commitments in README.md for the chosen target branch.
  • Documentation: Relevant documentation is added or updated.
  • The steps/process to test this feature are clearly explained including testing for regressions.
  • Changelog: Significant features, behavior changes, deprecations and fixes are added to CHANGELOG.md.
  • Commits: Commit messages follow guidelines in CONTRIBUTING.md, there are no fixup commits left.

@adriansmares adriansmares added the bug Something isn't working label Dec 17, 2023
@adriansmares adriansmares added this to the v3.29.0 milestone Dec 17, 2023
@adriansmares adriansmares self-assigned this Dec 17, 2023
@github-actions github-actions bot added c/console This is related to the Console ui/web This is related to a web interface labels Dec 17, 2023
@adriansmares adriansmares force-pushed the fix/event-stream-reconnect-5-adrian branch from d7acadc to db1c28f Compare December 17, 2023 12:23
pkg/console/internal/events/events.go Show resolved Hide resolved

return createUnknownErrorEvent({ error: errorString })
export const createSyntheticEventFromError = error => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not find a good way to import the actual types in order to do instanceof from the SDK, so I used the name instead. I am open to suggestions if we can do that.

Copy link
Contributor

@kschiffer kschiffer Dec 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could create all relevant error classes in JS and then create a mapping to do sth like:

class ConnectionError extends Error {
  constructor(err) {
    super(err)
    this.name = err.name
    this.code = err.code
    this.message = err.message
    // Add any other property specific to the error type
    
  }
}

const classMap = {
  "ConnectionError": ConnectionError,}
// In the error ingestion the error would then be instantiated like this
catch (err) {
  if (isContractualError(err)) { // some form of check that the error is a well-known error
    throw new classMap[err.name]
  }
  throw new Error("An unknown error occurred")
}
// Then in the err util it could be checked like this
if (error instanceof classMap["ConnectionError"]) {
  
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My point was that if we do that, we don't guarantee that the type caught is really the intended type - if two files in the JS SDK (accidentally) decide to define a ConnectionError, and we use name to identify name, we may mix them up, while the type would correctly be identified via instanceof.

I think this is more goldplating at the moment, and if we end up with such a situation we should just create an explicit errors package in the SDK in order to avoid duplication or conflicts.

pkg/webui/console/lib/events/utils.js Show resolved Hide resolved
sdk/js/src/api/stream/shared.js Show resolved Hide resolved
@adriansmares adriansmares marked this pull request as ready for review December 17, 2023 12:50
@adriansmares adriansmares requested a review from a team as a code owner December 17, 2023 12:50
Copy link
Contributor

@kschiffer kschiffer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. I have some small suggestions for improvement. I really did not expect this refactor to become this complex, but there's indeed many new edge cases that needed to be addressed. Thanks for helping out with this.

pkg/webui/console/lib/events/utils.js Show resolved Hide resolved
pkg/webui/console/store/middleware/logics/events.js Outdated Show resolved Hide resolved
sdk/js/src/api/stream/subscribeToWebSocketStreams.js Outdated Show resolved Hide resolved
@adriansmares adriansmares force-pushed the fix/event-stream-reconnect-5-adrian branch from f91b421 to 1b3b204 Compare December 18, 2023 10:33
@adriansmares adriansmares merged commit e798ae7 into fix/event-stream-reconnect-4 Dec 18, 2023
14 checks passed
@adriansmares adriansmares deleted the fix/event-stream-reconnect-5-adrian branch December 18, 2023 10:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working c/console This is related to the Console ui/web This is related to a web interface
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants