Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

#3227: Wait for the keepalive period to end before stopping the acceptance of new requests #3236

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

michdan
Copy link

@michdan michdan commented Jul 4, 2024

This change will prevent potential disruptions in request handling.
Currently, when a worker is restarted due to the max_request limit and the keepalive period is greater than zero, there is a gap between stopping the acceptance of new requests and starting a new worker. This happens because the process must wait for the lesser of the grace period and the keepalive period, even if there are no pending requests. It waits due to open connections that remain active for the duration of the keepalive period.
See also: #3227
cc: @benoitc @pajod

@michdan michdan changed the title 3227: Wait for the keepalive period to end before stopping the acceptance of new requests #3227: Wait for the keepalive period to end before stopping the acceptance of new requests Jul 4, 2024
…ance of new requests. This will prevent potential disruptions in request handling.
@michdan michdan force-pushed the 3227_wait_for_keepalive_before_restart branch from 29b7866 to 34ce875 Compare July 8, 2024 12:11
@michdan
Copy link
Author

michdan commented Jul 8, 2024

To test such functionality, consider using an approach inspired by Apache httpd: https://github.com/apache/httpd/blob/trunk/test/modules/core/test_002_restarts.py. This approach involves making assertions based on log output. While it would require implementing some additional tools, this method seems to simplify testing complex scenarios such as multiprocessing and greenlets.

# Retrieve all active greenlets and repeatedly check, until the keepalive period ends,
# if any of those greenlets are still active. If none are active, exit the loop.

greenlets = {id(greenlet) for server in servers for greenlet in server.pool.greenlets}
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this line is executed, self.alive is already False, which means that all keepalive connections are either closed or are being handled by the existing greenlets. No new keepalive connections will be created until the worker restarts.
Therefore, if the greenlets found in this line have finished, we do not need to wait any longer, as we are assured that there are no more active keepalive connections.

@pajod
Copy link
Contributor

pajod commented Jul 9, 2024

I am not yet confident I understand all possibly implications.. but whether further changed or not.. this definitely needs documentation (It already needed documentation before, but after the change the behaviour can be even more unexpected). Maybe a time table that lists the state transitions from hitting max_requests + jitter, then awaiting keepalive, then awaiting graceful_timeout, then app initialization - and what sort of responsiveness to a) existing and b) to new clients expect in each of the worker states? Those expectation could then be translated into (likely more verbose, less readable) test cases.

greenlets = {id(greenlet) for server in servers for greenlet in server.pool.greenlets}
ts = time.time()

while time.time() - ts <= self.cfg.keepalive:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obligatory "only do that with a monotonic clock" comment

Copy link
Author

@michdan michdan Jul 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I changed it to use a monotonic clock. By the way, shouldn't it be changed in other places as well? For example, I can see: https://github.com/benoitc/gunicorn/blob/master/gunicorn/workers/ggevent.py#L98

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michdan Absolutely correct, there are other places, but until I can improve my tests to actually cover such unimportant corner cases, my focus is on merely improving it while also touching the timeout behaviour anyway (as of now: here, plus in #3157).

Copy link
Owner

@benoitc benoitc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only the first part of this change (ensuring we don't process keepalive during closing is OK.

I disagree with the second part. Can you update your patch to only take care of the first one?

@@ -87,6 +87,10 @@ def run(self):
self.notify()
gevent.sleep(1.0)

# Wait for pending keepalive connections to complete before stopping the acceptance of new requests
if self.cfg.keepalive:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i disagree there, we shouldn't care about waiting for keepalive connections when closing. CLose should be done as fast as possible.

@@ -36,7 +36,8 @@ def handle(self, listener, client, addr):
parser = http.RequestParser(self.cfg, client, addr)
try:
listener_name = listener.getsockname()
if not self.cfg.keepalive:
# do not allow keepalive if the worker is about to be restarted
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this change is OK

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants