TaskStatus.started() being a no-op if cancelled can cause deadlocks #2895

oremanj · 2023-12-01T18:06:18Z

TaskStatus.started() starts with the logic:

        # If the old nursery is cancelled, then quietly quit now; the child
        # will eventually exit on its own, and we don't want to risk moving
        # children that might have propagating Cancelled exceptions into
        # a place with no cancelled cancel scopes to catch them.
        assert self._old_nursery._cancel_status is not None
        if self._old_nursery._cancel_status.effectively_cancelled:
            return

However, this logic does not hold if the function calling started() then proceeds to do its work in a shielded cancel scope. This describes trio-asyncio, and caused the deadlock reported in python-trio/trio-asyncio#115.

trio-asyncio can work around this by shielding the call to start(), but maybe Trio could also reconsider the logic in started()? I think at minimum it's safe to move the task if the new nursery is also effectively cancelled. It should also be safe to move the task if no Cancelled exception has actually been raised yet, but that's hard to track and maybe not worth it?

The text was updated successfully, but these errors were encountered:

njsmith · 2023-12-02T07:37:31Z

Maybe started() should raise an exception, RuntimeError or whatever, if the call to `start` has ever been in the cancelled state?

…

On Fri, Dec 1, 2023, 10:06 Joshua Oreman ***@***.***> wrote: TaskStatus.started() starts with the logic: # If the old nursery is cancelled, then quietly quit now; the child # will eventually exit on its own, and we don't want to risk moving # children that might have propagating Cancelled exceptions into # a place with no cancelled cancel scopes to catch them. assert self._old_nursery._cancel_status is not None if self._old_nursery._cancel_status.effectively_cancelled: return However, this logic does not hold if the function calling started() then proceeds to do its work in a shielded cancel scope. This describes trio-asyncio, and caused the deadlock reported in python-trio/trio-asyncio#115 <python-trio/trio-asyncio#115>. trio-asyncio can work around this by shielding the call to start(), but maybe Trio could also reconsider the logic in started()? I think at minimum it's safe to move the task if the new nursery is also effectively cancelled. It should also be safe to move the task if no Cancelled exception has actually been raised yet, but that's hard to track and maybe not worth it? — Reply to this email directly, view it on GitHub <#2895>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAEU42HCD6HJ6OOQFBXNQVDYHIMCPAVCNFSM6AAAAABADGRH7KVHI2DSMVQWIX3LMV43ASLTON2WKOZSGAZDCMZVGYZDKMQ> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

gschaffner · 2023-12-02T14:22:51Z

i noticed this too after hitting that trio-asyncio bug a few months ago. at the time, i wrote up a report and a corresponding patchset to propose, but i got busy and never got around to cleaning up the patchset and posting both. here's that old report (it is partially repetitive of what you wrote above, but i share it still because it notes some other symptoms alongside the deadlock and compares some alternatives for fixing this):

task_status.started doesn't reparent a task if it is effectively cancelled at the time that it calls started. this causes some problems:

deadlocks:

@asynccontextmanager
async def open_foo():
    async def serve(*, task_status):
        send_channel, receive_channel = trio.open_memory_channel(100)
        with receive_channel:
            task_status.started(send_channel)
            with CancelScope(shield=True):
                # Shield to ensure that even if we get a cancel request, we keep
                # processing until the queue is empty and closed.
                #
                # This is important because we need to guarantee that everything added
                # to the queue before send_channel is closed will get processed
                # (including things added to the queue by cleanup code). (Indeed, this
                # behavior is service-nursery-like.)
                async for item in receive_channel:
                    print(f"{item=!r}")

    async with trio.open_nursery() as nursery:
        send_channel = await nursery.start(serve)
        with send_channel:
            # In real code we'd yield an object with methods that add items to
            # the queue.
            yield


async def main():
    with CancelScope() as cancel_scope:
        cancel_scope.cancel()
        async with open_foo():
            print("this should get printed and then the program should exit")

(in real code using trio-asyncio, i encountered this deadlock occurring. the example above is a minimal reproducer for trio_asyncio shielding cancellation causes errors to be silenced. trio-asyncio#115 without all of the trio-asyncio stuff.)

in serve, when started returns it indicates to serve that start has been unblocked, meaning that the context manager open_foo() is guaranteed to close send_channel. with this knowledge, serve knows that it's safe for it to shield until both send_channel is closed and the channel buffer is empty.

the bug is that the started() call returns without reparenting the serve task. that is, started() succeeds and tells serve that start() has been unblocked, but start() has not actually been unblocked.

thus start() and serve get stuck in a deadlock waiting for each other. this
deadlock is also immune to KI/SIGINT; you have to hit it with SIGTERM/etc. instead.

if a task is effectively cancelled when it calls started() but becomes not effectively cancelled before cancellation is delivered to it, trio can mistakenly deliver cancellation to it. e.g.:

async def foo(*, task_status):
    task_status.started()
    # This task is no longer effectively cancelled.
    await trio.lowlevel.checkpoint()
    print("this should get printed")


async def main():
    async with trio.open_nursery() as nursery:
        with CancelScope() as cancel_scope:
            cancel_scope.cancel()
            await nursery.start(foo)

exceptions raised after task_status.started() can come out of the wrong place. e.g.:

async def foo(*, task_status):
    eventual_parent_nursery = trio.lowlevel.current_task().eventual_parent_nursery
    task_status.started()
    raise RuntimeError(
        f"I should come out of {eventual_parent_nursery!r}.__aexit__, not start()"
    )


async def main():
    async with trio.open_nursery() as nursery:
        nursery.cancel_scope.cancel()
        try:
            await nursery.start(foo)
        except RuntimeError:
            assert False

Task.eventual_parent_nursery can violate its documentation and Task.parent_nursery can violate the documentation of start. e.g.:

async def foo(eventual_parent_nursery, *, task_status):
    task_status.started()
    assert trio.lowlevel.current_task().parent_nursery is eventual_parent_nursery
    assert trio.lowlevel.current_task().eventual_parent_nursery is None


async def main():
    async with trio.open_nursery() as nursery:
        nursery.cancel_scope.cancel()
        await nursery.start(foo, nursery)

Cancellation blocks task_status.started #2544 is also in part due to this. its started call returns without reparenting t, so start blocks until t finishes completely. then, after t finishes, start raises Cancelled from its tmp_nursery.__aexit__.

DECEMBER EDIT: the interal nursery of start() should no longer raise Cancelled from its tmp_nursery.__aexit__ (Nursery: don't act as a checkpoint when not running anything #1696). so one of the two bugs causing Cancellation blocks task_status.started #2544 has been fixed, and the remaining problem causing it is that started() is being a no-op.

a comment in task_status.started discusses why it has the current logic:

trio/trio/_core/_run.py

Lines 791 to 796 in 4f17d2b

    
           # If the old nursery is cancelled, then quietly quit now; the child 
        
           # will eventually exit on its own, and we don't want to risk moving 
        
           # children that might have propagating Cancelled exceptions into 
        
           # a place with no cancelled cancel scopes to catch them. 
        
           if self._old_nursery._cancel_status.effectively_cancelled: 
        
               return

proposal: make task_status.started() never be a no-op return. it should either reparent and return, or fail to reparent and raise. there should be no third option where it returns success but secretly failed.

there are some different possible approaches to doing that:

let task_status.started succeed when called under an effectively cancelled scope (started() isn't a checkpoint, after all, so it has no right to raise Cancelled).

to replace

trio/trio/_core/_run.py

Lines 791 to 796 in 4f17d2b

    
           # If the old nursery is cancelled, then quietly quit now; the child 
        
           # will eventually exit on its own, and we don't want to risk moving 
        
           # children that might have propagating Cancelled exceptions into 
        
           # a place with no cancelled cancel scopes to catch them. 
        
           if self._old_nursery._cancel_status.effectively_cancelled: 
        
               return

, i.e. to prevent teleporting an in-flight Cancelled to somewhere in the tree without the correct cancel scope to catch it, have task_status.started raise RuntimeError if it's called while handling Cancelled(s).

don't let task_status.started succeed when called under an effectively cancelled scope. have it raise (Cancelled? RuntimeError? idk).

this is more limiting on users than the prior option. Trio would probably need to make started() raise if the current scope has ever been effectively cancelled (not just if it is currently effectively cancelled) because there could be Cancelleds in sys.exc_info() waiting to raise up.

at the moment i am hesitant about this second option because it feels to me that if the following is working as intended
```
async def foo():
    pass

with CancelScope() as cs:
    cs.cancel()
    async with trio.open_nursery() as nursery:
        nursery.start_soon(foo)
    print("this is printed because no cancel points were passed in foo")

# and the below behaves identically in terms of cancel points (ref:
# https://github.com/python-trio/trio/issues/1457#issuecomment-681091005)
with CancelScope() as cs:
    cs.cancel()
    await foo()
    print("this is printed because no cancel points were passed in foo")
```
then the following should also work
```
async def foo(*, task_status = trio.TASK_STATUS_IGNORED):
    task_status.started()
    await trio.lowlevel.checkpoint()

with CancelScope() as cs:
    cs.cancel()
    async with trio.open_nursery() as nursery:
        await nursery.start(foo)
        print("this is printed because no cancel points were passed in foo's startup")
```
the first option would let this work but the second option would have this raise something instead.

having the guarantee that [await start(afn) won't visit a cancel point if afn's startup code doesn't] seems useful when implementing objects with behavior similar to service nurseries. in such situations, like_a_service_nursery().__aenter__ needs to start the background task(s) unconditionally (even if __aenter__ (and thus started()) is in a cancelled scope) and not cancel them until the async with like_a_service_nursery() block's body exits, because there may be a shield in the body!

here's a proposed patchset for the first option: master...3e87b5c

DECEMBER EDIT: i've rebased that patchset onto master, adapted it for #1696 and cleaned it up. i will open a PR to propose it.

oremanj mentioned this issue Dec 1, 2023

trio_asyncio shielding cancellation causes errors to be silenced. python-trio/trio-asyncio#115

Closed

gschaffner linked a pull request Dec 2, 2023 that will close this issue

Fix TaskStatus.started() being a no-op when in an effectively cancelled scope #2896

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TaskStatus.started() being a no-op if cancelled can cause deadlocks #2895

TaskStatus.started() being a no-op if cancelled can cause deadlocks #2895

oremanj commented Dec 1, 2023

njsmith commented Dec 2, 2023 via email

gschaffner commented Dec 2, 2023

TaskStatus.started() being a no-op if cancelled can cause deadlocks #2895

TaskStatus.started() being a no-op if cancelled can cause deadlocks #2895

Comments

oremanj commented Dec 1, 2023

njsmith commented Dec 2, 2023 via email

gschaffner commented Dec 2, 2023