Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Set span error only for 5xx response range #1196

Merged
merged 2 commits into from
Oct 23, 2024

Conversation

j15e
Copy link
Contributor

@j15e j15e commented Oct 19, 2024

👋 Hello!

I ran into a very specific problem with the Rack instrumentation.

When a Rails Action Cable requests get traced by the Rack instrument, the response code returned by Rails is -1 and the body is empty as defined in Rack Hijacking spec.

In a nutshell, Rack hijacking exists to allow an app to take over the response streaming object to do special things like switching protocol to a web socket (just like Action Cable does).

This currently makes the Rack instrumentation add an error on the span because -1 is (rightfully) outside the defined good http statues (100..499).


def add_response_attributes(span, response)
span.status = OpenTelemetry::Trace::Status.error unless GOOD_HTTP_STATUSES.include?(response.status.to_i)
attributes = extract_response_attributes(response)
span.add_attributes(attributes)
rescue StandardError => e
OpenTelemetry.handle_error(exception: e)

This change will simply ignore the response object when the env marked the request as hijacked, which I think is was a Rack user would expect in this case.

Special thanks to @arielvalentin for pointing me to the Rack detailed spec when I initially asked the question about this problem in Slack just here.

Copy link

linux-foundation-easycla bot commented Oct 19, 2024

CLA Signed

The committers listed above are authorized under a signed CLA.

  • ✅ login: j15e / name: Jean-Philippe Doyle (34982e4)
  • ✅ login: arielvalentin / name: Ariel Valentin (a94d993)

@j15e j15e force-pushed the fix/rack-hijacked-response branch 2 times, most recently from 74f912c to 6acd6ee Compare October 21, 2024 20:22
@j15e j15e changed the title Handle Rack hijacking interface feat: Handle Rack hijacking interface Oct 21, 2024
Copy link
Contributor

@kaylareopelle kaylareopelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this PR and for your great research on rack.hijack! 🚀

@@ -109,7 +109,7 @@ def on_finish(request, response)
span = OpenTelemetry::Instrumentation::Rack.current_span
return unless span.recording?

add_response_attributes(span, response) if response
add_response_attributes(span, response) if response && !request.env.key?('rack.hijack?')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@j15e The more I think about your comments in slack, the more I agree with this implementation:

I see, so maybe changing the status check to > 500 would be a simpler and more pragmatic fix, is that what you think?1

When we take a closer look at the spec:

Span Status MUST be left unset if HTTP status code was in the 1xx, 2xx or 3xx ranges, unless there was another error (e.g., network error receiving the response body; or 3xx codes with max redirects exceeded), in which case status MUST be set to Error.
...
For HTTP status codes in the 5xx range, as well as any other code the client failed to interpret, span status SHOULD be set to Error.

I interpret that to mean that the code should only set the status to error in cases where the response is in the 5xx ranges . There are other non-5xx cases that the code isn't currently accounting for but the spec isn't terribly explicit about it other than the case of the Max Redirects Exceeded, which is usually a client side constraint.

Given the description the spec I think it is safe to change the code to look for Bad HTTP status codes in the 5xx range and leaving the span status unset in all other cases.

This will remove our explicit checks for the Rack Hijack implementation and allow the response context propagators to execute.

Footnotes

  1. https://cloud-native.slack.com/archives/C01NWKKMKMY/p1729384339814629?thread_ts=1729277160.681719&cid=C01NWKKMKMY

This also avoids setting span in error when a Rails ActionCable hijacked response returns a -1 status code
@j15e j15e force-pushed the fix/rack-hijacked-response branch from 6acd6ee to 34982e4 Compare October 23, 2024 00:44
@@ -208,7 +207,7 @@ def detach_context(request)
end

def add_response_attributes(span, response)
span.status = OpenTelemetry::Trace::Status.error unless GOOD_HTTP_STATUSES.include?(response.status.to_i)
span.status = OpenTelemetry::Trace::Status.error if response.server_error?
Copy link
Contributor Author

@j15e j15e Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found out that here we have access to the final Rack::Response object so this neat server_error? method can be re-used (not the case in the middleware instrument version though)

@j15e j15e changed the title feat: Handle Rack hijacking interface feat: Set span error only for 5xx response range Oct 23, 2024
@j15e
Copy link
Contributor Author

j15e commented Oct 23, 2024

Thanks for your kind and detailed feedback! I adjusted the PR to change the way the status code is set (only for the 5xx response code range) rather than make a special case for the rack.hijacked? flag (that felt unclean to me too!).

Copy link
Contributor

@kaylareopelle kaylareopelle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, thank you! I learned a lot during this review 😄

@arielvalentin arielvalentin enabled auto-merge (squash) October 23, 2024 19:33
@arielvalentin arielvalentin merged commit 5c647b5 into open-telemetry:main Oct 23, 2024
57 checks passed
@j15e j15e deleted the fix/rack-hijacked-response branch October 23, 2024 19:35
@github-actions github-actions bot mentioned this pull request Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants