Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry platform-level errors in the isolated process for .NET isolated #2922

Merged
merged 16 commits into from
Oct 3, 2024

Conversation

davidmrdavid
Copy link
Contributor

@davidmrdavid davidmrdavid commented Oct 1, 2024

Supersedes: #2902
Joint work with @andystaples.

Problem: In .NET isolated today, platform-level errors in the isolated process like OOMs, Timeouts, and sudden process exits are not automatically retried by DF. Instead, those cause the orchestrator to be marked as failed. This is an unintended behavior change for .NET isolated. This PR aims to bring back robustness against those errors.

Most of the changes related to this exist in the OutOfProcMiddleware and RemoteOrchestrationContext files.

"Extra scope": In trying to add automatic E2E tests to validate these improvements, I had to go down a rabbithole of .NET isolated testing issues in our smoke testing infrastructure today.

Long story short, our docker-based test runner for .NET isolated wasn't testing runtime correctness (it was passed the -NoValidation flag) because of intermittent race conditions when publishing .NET isolated apps to Docker. I didn't get far enough into debugging this to root cause the issue (@jviau tells me that upgrading the SDK may automatically solve them) but I did spend enough hours on it to conclude that, for the sake of unblocking this item, it would be practical to abandon the Docker-based testing for .NET isolated and instead use the GitHub actions CI environment directly.

Therefore, this PR changes the definition of the .NET isolated smoke testing GitHub action to instead run the tests directly on the "ubuntu-latest" image. It also introduces a new test runner powershell script that's well-suited for simulating platform-level errors. I hope the comments in the code itself clarify the intent, but please let me know if not.

Ask for reviewers: the key shippable changes here are in OutOfProcMiddleware and RemoteOrchestrationContext . Please look at those carefully. I've added comments explaining some of the quirky bits.

Copy link
Contributor

@andystaples andystaples left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left minor suggestion, otherwise, looks good

// From experience, this code runs in `<sourceCodePath>/bin/output/`, so we store the file two directories above.
// We do this because the /bin/output/ directory gets overridden during the build process, which happens automatically
// when `func host start` is re-invoked.
string evidenceFile = System.IO.Path.Combine(System.IO.Directory.GetCurrentDirectory(), "..", "..", "replayEvidence");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would suggest giving these replayEvidence files test-specific names to avoid conflicts should multiple retry tests run simultaneously. Even if today they are run sequentially, this may not always be true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right there's a pending improvement here. I chose not to do this for now because, based on how this is written today, the sequential checks protect us here, and we have some pressure to release. But I'm in favor about making this more robust long term.

@davidmrdavid davidmrdavid changed the title [WIP] Second test of transient platformnotsupported exceptions Retry platform-level errors in the isolated process for .NET isolated Oct 1, 2024
Comment on lines +14 to +15
"maxQueuePollingInterval": "00:00:01",
"controlQueueVisibilityTimeout": "00:01:00"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this just helps make sure we retry orchestrator replays fast enough after a platform-level error.

}
}
},
"functionTimeout": "00:00:30"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't time out sooner because our OOM-based test needs time to run out of memory.

Comment on lines +174 to +177
|| (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false))
{
// TODO: the `WorkerProcessExitException` type is not exposed in our dependencies, it's part of WebJobs.Host.Script.
// Should we add that dependency or should it be exposed in WebJobs.Host?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the string-based comparison, but it'll be string-based unless we add WebJobs.Host.Script to our dependencies. Open to a discussion here, but please note we have a customer waiting so let's look to be practical. I have no strong feelings, just trying to deploy this fix.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebJobs.Host.Script is not available for public consumption. We will need to work with the host team / WebJobs SDK to expose this information to extensions.

@davidmrdavid davidmrdavid marked this pull request as ready for review October 1, 2024 22:58
Copy link
Collaborator

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great PR! I have some questions and comments.

@@ -71,15 +71,15 @@ internal bool TryGetOrchestrationErrorDetails(out Exception? failure)
return hasError;
}

internal void SetResult(IEnumerable<OrchestratorAction> actions, string customStatus)
internal void TrySetResult(IEnumerable<OrchestratorAction> actions, string customStatus)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Non-blocking nit: Any method named TryXXX is usually expected to return a bool value indicating whether it succeeded or failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you're right. II chose Try to signal somehow that the method may throw, but I know the convention makes this confusing. I'll revert this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed here: 090f8e7

{
if (failureDetails.InnerFailure?.IsCausedBy<OutOfMemoryException>() ?? false)
{
throw new OutOfMemoryException(failureDetails.ErrorMessage);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manually throwing OutOfMemoryException is a bit of a red flag. It's really only supposed to be thrown by the runtime. It's partly problematic because there are various pieces of code in .NET and elsewhere that treat OOM and similar "fatal" exceptions differently. In some cases, I think it's supposed to crash the process.

I strongly recommend we throw something else, like SessionAbortedException, which is designed for this exact scenario.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the pointer. I'm now throwing SessionAbortedException, see: 090f8e7

@@ -475,6 +475,24 @@
</summary>
<param name="result">The result.</param>
</member>
<member name="M:Microsoft.Azure.WebJobs.Extensions.DurableTask.RemoteOrchestratorContext.ThrowIfPlatformLevelException(DurableTask.Core.FailureDetails)">
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any idea why this is showing up in our public documentation? I thought these APIs were internal?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My guess is that since I added method-level comments to the private method, they're made it to the xml. I think this is probably just to power the intellisense? Not sure, just a guess.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just intellisense, but also the public API docs that we publish, like this: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.webjobs.extensions.durabletask?view=azure-dotnet. If it's not meant to be published, then we should remove it from this xml file. Might be worth doing a bit of research as when when/why non-public APIs are getting included in this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that adding the method-level comments was the culprit.
I removed them (moved them to the method body), and the xml change was undone: c927268

}
}

// assuming the orchestrator survived the OOM, delete the evidence file and return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment, and it's making me question whether I understand what this test is supposed to be doing. I'll ask simply: why would we expect an orchestrator to survive an OOM? It sounds like you're saying there was an OOM but it was handled and we kept running in spite of the OOM. Are you actually saying it recovered from a worker process crash (caused by the OOM)? Maybe OOM implies crash, but it's not 100% clear that's what you meant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right the wording here was not great.

I do not expect an orchestration invocation to continue running beyond an OOM.

What I mean by survive was:

  • orchestrator invocation A runs
  • invocation A OOMs
  • DF extension abandons the invocation
  • DF extension retries the orchestrator invocation (now invocation B, every invocation is unique)
  • invocation does not OOM now, and completes

So it "survived" because it retried.

I'm hoping the new comments here make this clearer: a70d014

Environment.FailFast("Simulating crash!");
}

// assuming the orchestrator survived the OOM, delete the evidence file and return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this comment need to be updated?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is the result of copy-pasting.
Updated: a70d014


// Function input comes from the request content.
string instanceId = await client.ScheduleNewOrchestrationInstanceAsync(
nameof(TimeoutOrchestrator));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than replicating the same HTTP function once for each scenario, it might make more sense to have a single HTTP trigger that takes an orchestrator function name as a parameter.

Copy link
Contributor Author

@davidmrdavid davidmrdavid Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly here, but my personal preference is to have specific HTTP endpoints for each orchestrator. When testing locally, it saves me time to just copy-paste the entire URL to trigger the orchestrator.

But it's no big deal. If you feel strongly, I'll generalize it. Just noting this was a deliberate choice.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel strongly so I'll leave it up to you, but it would enable all the usual benefits of code sharing (easier to add new tests, better consistency b/c only one place to make updates, etc.)

@jviau
Copy link
Contributor

jviau commented Oct 1, 2024

Summarizing some offline discussion:

  1. Handling timeouts, worker crashes, and host out of memories is good
  2. We should not handle worker out of memory. This is language dependent and each language should handle OOM their own way
    • This should be left for future work. We will need to see if there is an existing mechanism for a language worker to signal an orchestration should be retried. If not, that mechanism will need to be added.
  3. I will follow up on the host side to see if there is an existing way to detect worker crashes or other platform-level failures (instead of checking exception type by string). If not, we can file an issue for that ask (and proceed with the string type check for now)

@davidmrdavid
Copy link
Contributor Author

Thanks @jviau. I'll delegate this bit to you then:

I will follow up on the host side to see if there is an existing way to detect worker crashes or other platform-level failures (instead of checking exception type by string). If not, we can file an issue for that ask (and proceed with the string type check for now)

As for the rest, I'll see what we can do. OOM is the key case we're trying to unblock via this PR (the rest are related "freebies"). @andystaples will start an internal thread about this.

@jviau
Copy link
Contributor

jviau commented Oct 3, 2024

I think all of the points I brought up can also be addressed after this PR, but it is something we should commit to doing ASAP.

  1. Adopting a more 'official' way to identify platform issues / worker crashes
  2. Implementing/consuming a way for workers to elect to abandon a session (which they can then use for OOM scenarios)

// - a worker process exit
if (functionResult.Exception is Host.FunctionTimeoutException
|| functionResult.Exception?.InnerException is SessionAbortedException // see RemoteOrchestrationContext.TrySetResultInternal for details on OOM-handling
|| (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you verify that WorkerProcessExitException is actually available here? Looking at the function host code, I don't see the exception actually thrown: https://github.com/Azure/azure-functions-host/blob/1088f24c3ae3a6275f18cbf091fa525c2477be91/src/WebJobs.Script/Workers/ProcessManagement/WorkerProcess.cs#L179-L184

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's how we stumbled upon this error type in the first place. We found this error by forcing a process exit (Environment.FailFast) and placing a breakpoint at this position.

But yeah I admit that I also don't see the explicit throw here. We can investigate further, or I can show you the runtime behavior through the debugger.

Comment on lines +174 to +177
|| (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false))
{
// TODO: the `WorkerProcessExitException` type is not exposed in our dependencies, it's part of WebJobs.Host.Script.
// Should we add that dependency or should it be exposed in WebJobs.Host?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WebJobs.Host.Script is not available for public consumption. We will need to work with the host team / WebJobs SDK to expose this information to extensions.

@davidmrdavid davidmrdavid merged commit 8a5db71 into dev Oct 3, 2024
14 checks passed
@davidmrdavid davidmrdavid deleted the dajusto/add-failed-orch-tests branch October 3, 2024 21:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants