fix(retry): Avoid panicking if responses come early #3216

cratelyn · 2024-09-22T22:10:05Z

linkerd-http-retry and linkerd-retry provide generic tower
middleware to allow services to retry requests that fail.

part of this middleware hinges on a ReplayBody that will lazily
buffer request bodies' data for use in subsequent attempts. if a request
fails, the retry middleware will attempt to send another request,
assuming the original body was able to completely fit into this buffer.

ReplayBody makes a subtle assummption, most succinctly stated in
this excerpt of an internal method's documentation:

// linkerd/http/retry/src/replay.rs
impl<B: Body> ReplayBody<B> {
    /// This panics if another clone has currently acquired the state, based on
    /// the assumption that a retry body will not be polled until the previous
    /// request has been dropped.
        fn acquire_state<'a>(
        state: &'a mut Option<BodyState<B>>,
        shared: &Mutex<Option<BodyState<B>>>,
    ) -> &'a mut BodyState<B> {
        // ...
    }
}

this assumption is slightly at odds with the request/response lifecycle
permitted within the HTTP/2 specification. see RFC 9113 § 8.1, "HTTP
Message Framing" (emphasis added):

An HTTP request/response exchange fully consumes a single stream. A
request starts with the HEADERS frame that puts the stream into the
"open" state. The request ends with a frame with the END_STREAM flag
set, which causes the stream to become "half-closed (local)" for the
client and "half-closed (remote)" for the server. A response stream
starts with zero or more interim responses in HEADERS frames, followed
by a HEADERS frame containing a final status code.

An HTTP response is complete after the server sends -- or the client
receives -- a frame with the END_STREAM flag set (including any
CONTINUATION frames needed to complete a field block). A server can
send a complete response prior to the client sending an entire request
if the response does not depend on any portion of the request that has
not been sent and received.

https://www.rfc-editor.org/rfc/rfc9113.html#section-8.1-11

because of this, a retry may panic when checking if the previous request
body was capped, if a server delivers a response before the request is
complete. this has been observed when retrying
wire-grpc requests, manifesting in a
panic with this message:

thread 'main' panicked at 'if our `state` was `None`, the shared state must be `Some`', /__w/linkerd2-proxy/linkerd2-proxy/linkerd/http-retry/src/replay.rs:152:22

this commit refactors ReplayBody::is_capped() so that it will no
longer panic if there is an outstanding body still being polled.
rather, it will return Some(true) or Some(false) if the previous
body was capped, or None if it has not finished streaming.

the related logic in the linkerd-http-retry library is updated to
refrain from attempting a retry if a response is received before the
request stream was completed.

cratelyn · 2024-09-22T22:13:25Z

linkerd/http/retry/src/replay.rs

-    pub fn is_capped(&self) -> bool {
+    pub fn is_capped(&self) -> Option<bool> {


☝️ this is the core of the proposed change.

if there's an outstanding clone still being polled, return None.

linkerd/http/retry/src/lib.rs

olix0r

w00t

#3216 (comment) Co-Authored-By: Oliver Gould <[email protected]> Signed-off-by: katelyn martin <[email protected]>

`linkerd-http-retry` and `linkerd-retry` provide generic `tower` middleware to allow services to retry requests that fail. part of this middleware hinges on a `ReplayBody` that will lazily buffer request bodies' data for use in subsequent attempts. if a request fails, the retry middleware will attempt to send another request, assuming the original body was able to completely fit into this buffer. `ReplayBody` makes a subtle assummption, most succinctly stated in this excerpt of an internal method's documentation: ```rust // linkerd/http/retry/src/replay.rs impl<B: Body> ReplayBody { /// This panics if another clone has currently acquired the state, based on /// the assumption that a retry body will not be polled until the previous /// request has been dropped. fn acquire_state<'a>( state: &'a mut Option<BodyState>, shared: &Mutex<Option<BodyState>>, ) -> &'a mut BodyState { // ... } } ``` this assumption is slightly at odds with the request/response lifecycle permitted within the HTTP/2 specification. see RFC 9113 § 8.1, "_HTTP Message Framing_" (emphasis added): > .. > An HTTP request/response exchange fully consumes a single stream. A > request starts with the HEADERS frame that puts the stream into the > "open" state. **The request ends with a frame with the END_STREAM flag > set**, which causes the stream to become "half-closed (local)" for the > client and "half-closed (remote)" for the server. A response stream > starts with zero or more interim responses in HEADERS frames, followed > by a HEADERS frame containing a final status code. > > An HTTP response is complete after the server sends -- or the client > receives -- a frame with the END_STREAM flag set (including any > CONTINUATION frames needed to complete a field block). **A server can > send a complete response prior to the client sending an entire request > if the response does not depend on any portion of the request that has > not been sent and received.** <https://www.rfc-editor.org/rfc/rfc9113.html#section-8.1-11> because of this, a retry may panic when checking if the previous request body was capped, if a server delivers a response before the request is complete. this has been observed when retrying [wire-grpc](https://github.com/square/wire) requests, manifesting in a panic with this message: ```text thread 'main' panicked at 'if our `state` was `None`, the shared state must be `Some`', /__w/linkerd2-proxy/linkerd2-proxy/linkerd/http-retry/src/replay.rs:152:22 ``` this commit refactors `ReplayBody::is_capped()` so that it will no longer panic if there is an outstanding body still being polled. rather, it will return `Some(true)` or `Some(false)` if the previous body was capped, or `None` if it has not finished streaming. the related logic in the `linkerd-http-retry` library is updated to refrain from attempting a retry if a response is received before the request stream was completed. Signed-off-by: katelyn martin <[email protected]>

#3216 (comment) Co-Authored-By: Oliver Gould <[email protected]> Signed-off-by: katelyn martin <[email protected]>

hawkw · 2024-09-23T22:18:48Z

Nice catch, this one had totally stumped me! :)

cratelyn commented Sep 22, 2024

View reviewed changes

cratelyn force-pushed the kate/fix-grpc-retry-panics branch from ae4ec8b to 233f68a Compare September 22, 2024 22:17

cratelyn marked this pull request as ready for review September 22, 2024 22:46

cratelyn requested a review from a team as a code owner September 22, 2024 22:46

This comment was marked as resolved.

Sign in to view

cratelyn force-pushed the kate/fix-grpc-retry-panics branch 2 times, most recently from 62c2cf7 to a249faf Compare September 23, 2024 15:07

olix0r reviewed Sep 23, 2024

View reviewed changes

linkerd/http/retry/src/lib.rs Outdated Show resolved Hide resolved

olix0r approved these changes Sep 23, 2024

View reviewed changes

cratelyn added a commit that referenced this pull request Sep 23, 2024

refactor(retry): apply review feedback

2305731

#3216 (comment) Co-Authored-By: Oliver Gould <[email protected]> Signed-off-by: katelyn martin <[email protected]>

cratelyn requested a review from olix0r September 23, 2024 21:27

cratelyn and others added 2 commits September 23, 2024 17:27

refactor(retry): apply review feedback

426017d

#3216 (comment) Co-Authored-By: Oliver Gould <[email protected]> Signed-off-by: katelyn martin <[email protected]>

cratelyn force-pushed the kate/fix-grpc-retry-panics branch from 2305731 to 426017d Compare September 23, 2024 21:27

cratelyn merged commit 67ed121 into main Sep 23, 2024
15 checks passed

cratelyn deleted the kate/fix-grpc-retry-panics branch September 23, 2024 21:51

cratelyn removed the request for review from olix0r September 23, 2024 21:51

cratelyn self-assigned this Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(retry): Avoid panicking if responses come early #3216

fix(retry): Avoid panicking if responses come early #3216

cratelyn commented Sep 22, 2024 •

edited

Loading

cratelyn Sep 22, 2024

This comment was marked as resolved.

olix0r left a comment

hawkw commented Sep 23, 2024

		pub fn is_capped(&self) -> bool {
		pub fn is_capped(&self) -> Option<bool> {

fix(retry): Avoid panicking if responses come early #3216

fix(retry): Avoid panicking if responses come early #3216

Conversation

cratelyn commented Sep 22, 2024 • edited Loading

cratelyn Sep 22, 2024

Choose a reason for hiding this comment

This comment was marked as resolved.

olix0r left a comment

Choose a reason for hiding this comment

hawkw commented Sep 23, 2024

cratelyn commented Sep 22, 2024 •

edited

Loading