Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Centralize exception handling and fix behavior for RequestTimeoutException #3063

Merged
merged 5 commits into from
Aug 4, 2023

Conversation

engechas
Copy link
Collaborator

When a request hits an Armeria server and cannot be processed within the request_timeout, a RequestTimeoutException occurs. Currently this case returns a 503 to the client. This PR fixes that by adding a custom exception handler to the push-based sources that properly handles RequestTimeoutExceptions.

As part of this work, I stumbled upon a couple other issues/quirks of the Armeria servers

  • Requests that are queued in the backend service's BlockingTaskExecutor will still be executed even if the request has already timed out. This adds unnecessary load to the pipeline by trying to process requests that have already returned a 503 (now 408/429 depending on http vs grpc) to the client. Added a check that a request is not timed out before attempting to process it.
  • The HTTP source gives 80% of the request_timeout to write to the buffer. This makes sense as there is additional work necessary after the write finishes that needs to complete before returning a 200 back to the client. The OTEL sources did not follow this pattern and instead gave 100% of the request_timeout to write to the buffer. Adjusted this to be 80% to align with HTTP source. I also noticed that the Armeria requests don't timeout exactly at the request_timeout so there's no apparent benefit in terms of unnecessary work reduction by having the buffer write timeout and request_timeout match as requests that will timeout still get picked up.
  • The OTEL sources have nearly identical exception handling and metric emission. Pulled that out into a central class for code reuse

Check List

  • New functionality includes testing.
  • [N/A] New functionality has been documented.
    • [N/A] New functionality has javadoc added
  • Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chase Engelbrecht <[email protected]>
Signed-off-by: Chase Engelbrecht <[email protected]>
Signed-off-by: Chase Engelbrecht <[email protected]>
List<OpenTelemetryLog> logs;

if (Context.current().isCancelled()) {
throw new RequestCancelledException("Cancelled by client");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QUES: Is there reason to favor throwing exception rather than responseObserver.onError?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the request times out at the Armeria level, the request is terminated before the responseObserver has a chance to be populated. I wanted to centralize the logic for exception handling + metrics as well as cover all cases with a single pattern. Throwing the exception and then handling it in the GrpcRequestExceptionHandler allows for that.

The other caveat is that there is an edge case where we end up emitting both a requestTimeout metric and a badRequest/second requestTimeout/internalServerException/successRequest metric if we only handle the Armeria timeouts and continue to have the original metric/error handling here. This happens if we start serving the request before it times out but it times out before the request is fully served.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I did not realize it is captured by the exception handler.

}

LOG.error("Failed to write the request of size {} due to:", request.toString().length(), e);
throw new BufferWriteException(e.getMessage(), e);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar question as above

chenqi0805
chenqi0805 previously approved these changes Jul 28, 2023
Objects.requireNonNull(message);
if (e instanceof IOException) {
badRequestsCounter.increment();
return HttpResponse.of(HttpStatus.BAD_REQUEST, MediaType.ANY_TYPE, message);
} else if (e instanceof TimeoutException) {
} else if (e instanceof TimeoutException || e instanceof RequestTimeoutException) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between TimeoutException and RequestTimeoutException?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TimeoutException is thrown by the BlockingBuffer, RequestTimeoutException is thrown by Armeria when the requestTimeout has expired

public class RequestExceptionHandler {
public class HttpRequestExceptionHandler implements ExceptionHandlerFunction {
static final String ARMERIA_REQUEST_TIMEOUT_MESSAGE = "Timeout waiting for request to be served. This is usually due to the buffer being full.";
static final String DEFAULT_MESSAGE = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any particular reason this is empty? Should we have something generic like "Internal Server Error"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. I just pulled this out into a constant from the original implementation. This empty string could be populated for non-5XX exceptions as well so I'm not sure if there's a good generic message. Open to suggestions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you are saying this was the default behavior.

if (Context.current().isCancelled()) {
requestTimeoutCounter.increment();
responseObserver.onError(Status.CANCELLED.withDescription("Cancelled by client").asRuntimeException());
if (ServiceRequestContext.current().isTimedOut()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the fix you were referring to here?

Requests that are queued in the backend service's BlockingTaskExecutor will still be executed even if the request has already timed out. This adds unnecessary load to the pipeline by trying to process requests that have already returned a 503 (now 408/429 depending on http vs grpc) to the client. Added a check that a request is not timed out before attempting to process it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

if (e instanceof RequestTimeoutException) {
message = ARMERIA_REQUEST_TIMEOUT_MESSAGE;
} else {
message = e.getMessage() == null ? DEFAULT_MESSAGE : e.getMessage();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This DEFAULT_MESSAGE could possibly be improved since the handleExceptions method already determines what type it is. So we can probably remove DEFAULT_MESSAGE and derive it from the Status.

Assuming that Status::toString returns a useful string, this could be:

message = e.getMessage() == null ? status.toString() : e.getMessage();

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one's a little trickier. The toString method is

return MoreObjects.toStringHelper(this)
        .add("code", code.name())
        .add("description", description)
        .add("cause", cause != null ? getStackTraceAsString(cause) : cause)
        .toString();

The cause and description are null by default. We could add the whole exception as the cause but I don't think we want to return the stacktrace in the HTTP response.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could add something generic like Unexpected exception encountered: <exception class>. Let me know what your thoughts are

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about just code.name() instead of Status.toString()?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is, status.getCode.name()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, will do

public class RequestExceptionHandler {
public class HttpRequestExceptionHandler implements ExceptionHandlerFunction {
static final String ARMERIA_REQUEST_TIMEOUT_MESSAGE = "Timeout waiting for request to be served. This is usually due to the buffer being full.";
static final String DEFAULT_MESSAGE = "";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand you are saying this was the default behavior.

message = cause.getMessage() == null ? DEFAULT_MESSAGE : cause.getMessage();
}

return handleException(cause, message);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the default message issue can be resolved here.

This private handleException method could be refactored to return HttpStatus. All the return values appear to generate the same HttpResponse aside from the status.

Then this line here becomes:

HttpStatus status = handleException(cause, /*message no longer needed*/);
return HttpResponse.of(status, MediaType.ANY_TYPE, cause.getMessage() == null ? status.toString() : cause.getMessage());

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Will refactor

Signed-off-by: Chase Engelbrecht <[email protected]>
Copy link
Member

@dlvenable dlvenable left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@engechas engechas merged commit 259fea1 into opensearch-project:main Aug 4, 2023
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants