Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(api): Pause when pick_up_tip() errors in a Python protocol #14753

Merged
merged 25 commits into from
Apr 9, 2024

Conversation

SyntaxColoring
Copy link
Contributor

@SyntaxColoring SyntaxColoring commented Mar 28, 2024

Overview

Closes EXEC-346.

Test Plan

Setup

Enable the hidden error recovery feature flag by manually issuing a POST /settings request with {"id": "enableErrorRecoveryExperiments", "value": true}.

Test

Upload and run this protocol.

from opentrons import protocol_api

requirements = {"robotType": "Flex", "apiLevel": "2.17"}


def run(protocol: protocol_api.ProtocolContext) -> None:
    protocol.load_trash_bin("A3")

    tip_rack = protocol.load_labware("opentrons_flex_96_tiprack_50ul", "C2")
    reservoir = protocol.load_labware(
        "opentrons_96_wellplate_200ul_pcr_full_skirt", "C3"
    )
    pipette = protocol.load_instrument(
        "flex_1channel_50", mount="left", tip_racks=[tip_rack]
    )

    for source, dest in zip(reservoir.columns()[0], reservoir.columns()[1]):
        pipette.pick_up_tip()
        pipette.move_to(source.top())
        pipette.move_to(dest.top())
        pipette.drop_tip()
  • When it hits a missing spot in the tip rack, it should pause. From there, you should be able to issue a stop or resume-from-recovery action to POST /runs/{id}/actions.
  • If you do resume-from-recovery, the pick_up_tip() calls later on in the protocol should continue from the next wells in the tip rack. It should not repeatedly try to pick up tips from the same well.

Cleanup

POST /settings with {"id": "enableErrorRecoveryExperiments", "value": false}.

Changelog

  • We currently have some machinery to send a command to ProtocolEngine and block until the command finishes executing. The main part of this PR is adding variants of that to also wait for the error recovery to finish. We reimplement ProtocolContext.pick_up_tip(), specifically, with that new variant. Other commands will happen later.
  • The tip handler in ProtocolEngine is modified so that when a pickUpTip fails recoverably, it marks the requested tips as used. This is one way of getting the next PAPI pick_up_tip() to automatically move on to the next tip. See discussion below.

Review requests

See my review comments.

Risk assessment

Low-risk when the enableErrorRecoveryExperiments feature flag is off. Medium-risk if it's on, since this starts venturing further into territory where Protocol Engine's understanding of the protocol history and state is different from PAPI's understanding of the protocol history and state.

Copy link

codecov bot commented Mar 28, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 67.24%. Comparing base (65885b2) to head (6b79625).
Report is 24 commits behind head on edge.

❗ Current head 6b79625 differs from pull request most recent head cf1e936. Consider uploading reports for the commit cf1e936 to get more accurate results

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             edge   #14753      +/-   ##
==========================================
- Coverage   67.24%   67.24%   -0.01%     
==========================================
  Files        2495     2495              
  Lines       71254    71253       -1     
  Branches     8937     8937              
==========================================
- Hits        47918    47917       -1     
  Misses      21235    21235              
  Partials     2101     2101              
Flag Coverage Δ
g-code-testing 92.43% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...i/src/opentrons/protocol_engine/actions/actions.py 100.00% <ø> (ø)
...c/opentrons/protocol_engine/clients/sync_client.py 100.00% <ø> (ø)
...rc/opentrons/protocol_engine/clients/transports.py 100.00% <ø> (ø)
...rons/protocol_engine/execution/command_executor.py 100.00% <ø> (ø)
...pentrons/protocol_engine/execution/queue_worker.py 100.00% <ø> (ø)
...i/src/opentrons/protocol_engine/protocol_engine.py 100.00% <ø> (ø)
...pi/src/opentrons/protocol_engine/state/commands.py 99.18% <ø> (ø)
api/src/opentrons/protocol_engine/state/state.py 100.00% <ø> (ø)
...opentrons/protocol_runner/legacy_command_mapper.py 98.21% <ø> (ø)

... and 4 files with indirect coverage changes

@DerekMaggio
Copy link
Contributor

@SyntaxColoring, #14748 contains some error recovery protocols if you need any test cases.

Comment on lines 245 to 247
async def add_and_execute_command_wait_for_recovery(
self, request: commands.CommandCreate
) -> None:
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Apr 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm open to ideas for better names for this.

Or maybe we should make it a wait_for_recovery: bool argument on the existing add_and_execute() method.

@SyntaxColoring SyntaxColoring changed the title feat(api): Pause-on-error for Python protocols feat(api): Pause when pick_up_tip() errors in a Python protocol Apr 1, 2024
api/src/opentrons/protocol_engine/state/commands.py Outdated Show resolved Hide resolved
api/src/opentrons/protocol_api/core/engine/instrument.py Outdated Show resolved Hide resolved

# FIX BEFORE MERGE: We should probably only set_last_location() if the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree - I think there are three answers each of which is honestly reasonable, but one which is the best (and also hardest). As a preamble, we should probably push last-location down to the engine.

  1. the best and hardest is, base the decision of setting last location on when the error occurred (during which action) and what the error is. For instance, if you got a stall while moving to the tiprack, we should clear the last location since position is now indeterminate; but if we failed our tip-state consistency check after the pickup, we should set the last location to the rack.
  2. the simplest and most robust is, always clear the last location on any error
  3. and a slightly more optimistic version is to always set it on any error since the most likely causes of failure will in fact leave the gantry in the specified location

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking at this closer, I think it's appropriate to call this set_last_location() unconditionally. See this new comment:

self._engine_client.pick_up_tip_wait_for_recovery(
pipette_id=self._pipette_id,
labware_id=labware_id,
well_name=well_name,
well_location=well_location,
)
# Set the "last location" unconditionally, even if the command failed
# and was recovered from and we don't know if the pipette is physically here.
# This isn't used for path planning, but rather for implicit destination
# selection like in `pipette.aspirate(location=None)`.
self._protocol_core.set_last_location(location=location, mount=self.get_mount())

Comment on lines +311 to +312
# FIXME(mm, 2024-04-02): As of https://github.com/Opentrons/opentrons/pull/14726,
# this action is only legal if the command is running, not queued.
Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my fault; in #14726, I neglected to account for what this estop() method was doing. I'm going to have to fix this in another PR. EXEC-382

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Apr 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of this estop() method is due for a rethink. Like, I don't really get why it needs to be so different from stop(), and why it needs to be messing with things like FailCommandActions itself.

@SyntaxColoring SyntaxColoring marked this pull request as ready for review April 8, 2024 15:57
@SyntaxColoring SyntaxColoring requested a review from a team as a code owner April 8, 2024 15:57
Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall, but a couple points:

  • I'm a bit concerned about the "last command to fail" logic in the command store. Can we instead save the command id when it fails recoverably and we enter AWAITING_RECOVERY and make that state more explicit?
  • Something feels slightly off about the checks in ProtocolEngine.add_and_execute_command_wait_for_recovery since they check two different pieces of state in tight sequence

# which remains `queued` because of the pause.
# 3. The engine is stopped. The returned command will be `queued`
# and won't have a result.
raise ProtocolCommandFailedError(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make this its own unique exception type and code so that we can programmatically differentiate it from other errors

Copy link
Contributor Author

@SyntaxColoring SyntaxColoring Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 97b0ada.

api/src/opentrons/protocol_engine/protocol_engine.py Outdated Show resolved Hide resolved
@@ -708,6 +708,15 @@ def get_all_commands_final(self) -> bool:

return no_command_running and no_command_to_execute

def get_recovery_in_progress_for_command(self, command_id: str) -> bool:
"""Return whether the given command failed and its error recovery is in progress."""
command_is_most_recent_to_fail = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a little gross to me, could we save the command we're currently recovering from explicitly? like, what happens if a fixit command fails, wouldn't it instantly make this False?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed in a880096.

api/src/opentrons/protocol_engine/state/tips.py Outdated Show resolved Hide resolved
queued_command = self.add_command(request)
await self.wait_for_command(command_id=queued_command.id)
completed_command = self._state_store.commands.get(queued_command.id)
await self._state_store.wait_for_not(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it kind of feels like we want to gate this on a synchronous call to whether we're now in recovery for the command, but this in general feels a little race condition-y because of the separation between command state and run state. Specifically,

  • wait_for_command returns on the asyncio spin after a FailCommandAction or a SucceedCommandAction for this command (we can neglect the queued-and-stopping part for now)
  • but the get_recovery_in_progress_for_command predicate is based on the queue status being awaiting-recovery

Do we guarantee mechanically that the queue status will be set before the asyncio spin after the command terminal action is dispatched? Are we sure this won't occasionally race and return early?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if I understand your concern correctly:

When await self.wait_for_command(command_id=queued_command.id) returns, we are guaranteed that the action that finalized the command has already been fully processed, and that get_recovery_in_progress_for_command() will see its results on the state.

When we handle an action, we send it to each of the substores in a loop. Only after that's done do we notify subscribers like this one.

@@ -167,6 +167,7 @@ async def execute(self, command_id: str) -> None:
FailCommandAction(
error=error,
command_id=running_command.id,
running_command=running_command,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I am wondering why we need this in the action? dont we have the failed command stored in PE already?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a hack, we probably don't need it. See this comment:

# This is a quick hack so FailCommandAction handlers can get the params of the
# command that failed. We probably want this to be a new "failure details"
# object instead, similar to how succeeded commands can send a "private result"
# to Protocol Engine internals.

Copy link
Contributor

@TamarZanzouri TamarZanzouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall looks good, added a question and a bit confused about the wait_for_not although I get what its doing but the logic around it is a bit confusing to me. besides that looks great!

Comment on lines +181 to +183
recovery_target_command_id: Optional[str]
"""If we're currently recovering from a command failure, which command it was."""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding this as an orthogonal attribute is quick and dirty. Ideally, this ID would not exist at all when QueueStatus is anything other than AWAITING_RECOVERY. We could do that by retructuring this state so it's closer to a union of dataclasses instead of a dataclass of unions.

Copy link
Member

@sfoster1 sfoster1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thank you for the changes!

@SyntaxColoring SyntaxColoring merged commit 1819b8c into edge Apr 9, 2024
20 checks passed
@SyntaxColoring SyntaxColoring deleted the papi_pause_on_error branch April 9, 2024 15:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants