[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

When the rebuild backend is dropped, we must also drain the async channel. This covers a corner case where a message may be sent at the same time as we're dropping and in this case the message would hang. This is not a hang for prod as there we have timeouts which would eventually cancel the future and allow the drop, though this can still lead to timeouts and confusion. Signed-off-by: Tiago Castro <[email protected]>

This seems to have been mistakenly added as ms. In practice this would have caused no harm as this value is not currently being overrided by the helm chart. Signed-off-by: Tiago Castro <[email protected]>

When we are pausing the nexus, all IO must get flushed before the subsystem pausing completes. If we can't flush the IO then pausing is stuck forever... The issue we have seen is that when IO's are stuck there's nothing which can fail them and allow pause to complete. One way this can happen is when the controller is failed as it seems in this case the io queues are not getting polled. A first fix that can be done is to piggy back on the adminq polling failure and use this to drive the removal of the failed child devices from the nexus per-core channels. A better approach might be needed in the future to be able to timeout the IOs even when no completions are processed in a given I/O qpair. Signed-off-by: Tiago Castro <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

[ Cherry-Pick ] Fix stuck rebuilds and stuck nexus subsystems #1721

Commits on Aug 14, 2024

Commits on Aug 16, 2024

Commits on Aug 27, 2024