Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SRVKS-1171] [RELEASE-1.12] Graceful shutdown #130

Conversation

skonto
Copy link

@skonto skonto commented Apr 29, 2024

norbjd and others added 2 commits April 29, 2024 14:36
…sions#1203)

* Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image

* Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)

* Drain listeners with appropriate endpoint

* Simpler drain + sleep

* Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds

* Use a perl script (no need to open the admin HTTP interface!)

* Use bash instead of perl in preStop hook

* Review @skonto comments: use socket address for admin cluster

* [WIP] add graceful shutdown test and tweak CI to just run that test

* [WIP] Fix gracefulshutdown_test.go

* [WIP] try to fix race condition and lint

* [WIP] use initialTimeout + debug

* [WIP] fix gracefulshutdown_test.go logic

* [WIP] refacto and add some comments to clarify

* [WIP] fix lint

* [WIP] reintroduce kind-e2e-upgrade.yaml

* [WIP] add test case when request takes a little longer than the drain time

* [WIP] fix compilation issue

* [WIP] FIx compilation issue (again)

* [WIP] hopefully fix data race

* [WIP] refacto and hopefully fix race condition (use sync.Map)

* [WIP] fix compilation issue

* [WIP] Handle EOF

* [WIP] check gateway pod has been removed + manual debugging

* [WIP] debugging

* [WIP] more debugging

* [WIP] more debugging

* [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF

* [WIP] remove debugging related stuff

* Revert all unnecessary changes made for testing

* Revert unnecessary change (livenessProbe)

* Scale to 1 replica

* Typo

* Run gracefulshutdown test first (speed up feedback loop)

* Add a comment for terminationGracePeriodSeconds

* Don't update deployment twice

Patch env and terminationGracePeriodSeconds at the same time

* Fix bad patch

* Run gracefulshutdown test at the end

- avoids conflicts with other tests
- change gracefulshutdown test to delete all gateway pods

* Fix gracefulshutdown test

* Fix gracefulshutdown test

* Lint
@skonto
Copy link
Author

skonto commented Apr 29, 2024

/hold I will test downstream at the S-O side.

@codecov-commenter
Copy link

codecov-commenter commented Apr 29, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 62.31%. Comparing base (d7abd37) to head (c228fcc).

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@                Coverage Diff                @@
##           release-v1.12     #130      +/-   ##
=================================================
+ Coverage          60.63%   62.31%   +1.67%     
=================================================
  Files                 24       24              
  Lines               2002     1632     -370     
=================================================
- Hits                1214     1017     -197     
+ Misses               726      553     -173     
  Partials              62       62              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@skonto skonto changed the title [SRVKS-1171] [RELEASE-1.12] Graceful shutdown 1.12 [SRVKS-1171] [RELEASE-1.12] Graceful shutdown Apr 29, 2024
Copy link
Member

@ReToCode ReToCode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

Copy link

openshift-ci bot commented Apr 29, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ReToCode, skonto

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ReToCode
Copy link
Member

/cherry-pick release-v1.14

@openshift-cherrypick-robot

@ReToCode: once the present PR merges, I will cherry-pick it on top of release-v1.14 in a new PR and assign it to you.

In response to this:

/cherry-pick release-v1.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -526,7 +542,9 @@ spec:
memory: 200Mi
limits:
cpu: "1"
memory: 500Mi
memory: 800Mi
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bringing this from the upstream. @ReToCode do we need to update any docs if we bring this in?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so, we have not documented limits, just what is a typical use with a certain amount of service. Maybe as a release note to raise awareness?

@skonto
Copy link
Author

skonto commented Apr 30, 2024

/unhold

Tests passed.

@openshift-merge-bot openshift-merge-bot bot merged commit 553b97a into openshift-knative:release-v1.12 Apr 30, 2024
18 checks passed
@openshift-cherrypick-robot

@ReToCode: #130 failed to apply on top of branch "release-v1.14":

Applying: Gracefully drain connections when stopping the gateway (#1203)
Using index info to reconstruct a base tree...
M	config/300-gateway.yaml
Falling back to patching base and 3-way merge...
Auto-merging config/300-gateway.yaml
CONFLICT (content): Merge conflict in config/300-gateway.yaml
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 Gracefully drain connections when stopping the gateway (#1203)
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

In response to this:

/cherry-pick release-v1.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-merge-bot bot pushed a commit that referenced this pull request May 14, 2024
* [SRVKS-1171] [RELEASE-1.12] Graceful shutdown (#130)

* Gracefully drain connections when stopping the gateway (knative-extensions#1203)

* Fix gateway's preStop hook: curl does not exist (anymore?) in envoy image

* Before stopping the gateway, wait until requests are finished on all public listeners (and exit anyway if it exceeds terminationGracePeriodSeconds)

* Drain listeners with appropriate endpoint

* Simpler drain + sleep

* Remove PARENT_SHUTDOWN_TIME_SECONDS and terminationGracePeriodSeconds

* Use a perl script (no need to open the admin HTTP interface!)

* Use bash instead of perl in preStop hook

* Review @skonto comments: use socket address for admin cluster

* [WIP] add graceful shutdown test and tweak CI to just run that test

* [WIP] Fix gracefulshutdown_test.go

* [WIP] try to fix race condition and lint

* [WIP] use initialTimeout + debug

* [WIP] fix gracefulshutdown_test.go logic

* [WIP] refacto and add some comments to clarify

* [WIP] fix lint

* [WIP] reintroduce kind-e2e-upgrade.yaml

* [WIP] add test case when request takes a little longer than the drain time

* [WIP] fix compilation issue

* [WIP] FIx compilation issue (again)

* [WIP] hopefully fix data race

* [WIP] refacto and hopefully fix race condition (use sync.Map)

* [WIP] fix compilation issue

* [WIP] Handle EOF

* [WIP] check gateway pod has been removed + manual debugging

* [WIP] debugging

* [WIP] more debugging

* [WIP] more debugging

* [WIP] increase livenessProbe failure threshold as I'm not sure it should return EOF

* [WIP] remove debugging related stuff

* Revert all unnecessary changes made for testing

* Revert unnecessary change (livenessProbe)

* Scale to 1 replica

* Typo

* Run gracefulshutdown test first (speed up feedback loop)

* Add a comment for terminationGracePeriodSeconds

* Don't update deployment twice

Patch env and terminationGracePeriodSeconds at the same time

* Fix bad patch

* Run gracefulshutdown test at the end

- avoids conflicts with other tests
- change gracefulshutdown test to delete all gateway pods

* Fix gracefulshutdown test

* Fix gracefulshutdown test

* Lint

* run hack/update-deps.sh

* update openshift files

---------

Co-authored-by: norbjd <[email protected]>

* update deps

---------

Co-authored-by: norbjd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants