Retry platform-level errors in the isolated process for .NET isolated #2922

davidmrdavid · 2024-10-01T02:16:07Z

Supersedes: #2902
Joint work with @andystaples.

Problem: In .NET isolated today, platform-level errors in the isolated process like OOMs, Timeouts, and sudden process exits are not automatically retried by DF. Instead, those cause the orchestrator to be marked as failed. This is an unintended behavior change for .NET isolated. This PR aims to bring back robustness against those errors.

Most of the changes related to this exist in the OutOfProcMiddleware and RemoteOrchestrationContext files.

"Extra scope": In trying to add automatic E2E tests to validate these improvements, I had to go down a rabbithole of .NET isolated testing issues in our smoke testing infrastructure today.

Long story short, our docker-based test runner for .NET isolated wasn't testing runtime correctness (it was passed the -NoValidation flag) because of intermittent race conditions when publishing .NET isolated apps to Docker. I didn't get far enough into debugging this to root cause the issue (@jviau tells me that upgrading the SDK may automatically solve them) but I did spend enough hours on it to conclude that, for the sake of unblocking this item, it would be practical to abandon the Docker-based testing for .NET isolated and instead use the GitHub actions CI environment directly.

Therefore, this PR changes the definition of the .NET isolated smoke testing GitHub action to instead run the tests directly on the "ubuntu-latest" image. It also introduces a new test runner powershell script that's well-suited for simulating platform-level errors. I hope the comments in the code itself clarify the intent, but please let me know if not.

Ask for reviewers: the key shippable changes here are in OutOfProcMiddleware and RemoteOrchestrationContext . Please look at those carefully. I've added comments explaining some of the quirky bits.

…e-extension into dajusto/add-failed-orch-tests

andystaples

Left minor suggestion, otherwise, looks good

andystaples · 2024-10-01T20:43:48Z

test/SmokeTests/OOProcSmokeTests/DotNetIsolated/FaultyOrchestrators.cs

+            // From experience, this code runs in `<sourceCodePath>/bin/output/`, so we store the file two directories above.
+            // We do this because the /bin/output/ directory gets overridden during the build process, which happens automatically
+            // when `func host start` is re-invoked.
+            string evidenceFile = System.IO.Path.Combine(System.IO.Directory.GetCurrentDirectory(), "..", "..", "replayEvidence");


Would suggest giving these replayEvidence files test-specific names to avoid conflicts should multiple retry tests run simultaneously. Even if today they are run sequentially, this may not always be true

You're right there's a pending improvement here. I chose not to do this for now because, based on how this is written today, the sequential checks protect us here, and we have some pressure to release. But I'm in favor about making this more robust long term.

davidmrdavid · 2024-10-01T22:55:41Z

test/SmokeTests/OOProcSmokeTests/DotNetIsolated/host.json

+        "maxQueuePollingInterval": "00:00:01",
+        "controlQueueVisibilityTimeout": "00:01:00"


this just helps make sure we retry orchestrator replays fast enough after a platform-level error.

davidmrdavid · 2024-10-01T22:55:59Z

test/SmokeTests/OOProcSmokeTests/DotNetIsolated/host.json

+      }
+    }
+  },
+  "functionTimeout": "00:00:30"


We don't time out sooner because our OOM-based test needs time to run out of memory.

davidmrdavid · 2024-10-01T22:58:35Z

src/WebJobs.Extensions.DurableTask/OutOfProcMiddleware.cs

+                    || (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false))
+                {
+                    // TODO: the `WorkerProcessExitException` type is not exposed in our dependencies, it's part of WebJobs.Host.Script.
+                    // Should we add that dependency or should it be exposed in WebJobs.Host?


I don't love the string-based comparison, but it'll be string-based unless we add WebJobs.Host.Script to our dependencies. Open to a discussion here, but please note we have a customer waiting so let's look to be practical. I have no strong feelings, just trying to deploy this fix.

WebJobs.Host.Script is not available for public consumption. We will need to work with the host team / WebJobs SDK to expose this information to extensions.

cgillum

This is a great PR! I have some questions and comments.

cgillum · 2024-10-01T23:16:24Z

src/WebJobs.Extensions.DurableTask/ContextImplementations/RemoteOrchestratorContext.cs

@@ -71,15 +71,15 @@ internal bool TryGetOrchestrationErrorDetails(out Exception? failure)
            return hasError;
        }

-        internal void SetResult(IEnumerable<OrchestratorAction> actions, string customStatus)
+        internal void TrySetResult(IEnumerable<OrchestratorAction> actions, string customStatus)


Non-blocking nit: Any method named TryXXX is usually expected to return a bool value indicating whether it succeeded or failed.

yes, you're right. II chose Try to signal somehow that the method may throw, but I know the convention makes this confusing. I'll revert this change.

fixed here: 090f8e7

cgillum · 2024-10-01T23:19:36Z

src/WebJobs.Extensions.DurableTask/ContextImplementations/RemoteOrchestratorContext.cs

+        {
+            if (failureDetails.InnerFailure?.IsCausedBy<OutOfMemoryException>() ?? false)
+            {
+                throw new OutOfMemoryException(failureDetails.ErrorMessage);


Manually throwing OutOfMemoryException is a bit of a red flag. It's really only supposed to be thrown by the runtime. It's partly problematic because there are various pieces of code in .NET and elsewhere that treat OOM and similar "fatal" exceptions differently. In some cases, I think it's supposed to crash the process.

I strongly recommend we throw something else, like SessionAbortedException, which is designed for this exact scenario.

Thanks for the pointer. I'm now throwing SessionAbortedException, see: 090f8e7

cgillum · 2024-10-01T23:22:14Z

src/WebJobs.Extensions.DurableTask/Microsoft.Azure.WebJobs.Extensions.DurableTask.xml

@@ -475,6 +475,24 @@
            </summary>
            <param name="result">The result.</param>
        </member>
+        <member name="M:Microsoft.Azure.WebJobs.Extensions.DurableTask.RemoteOrchestratorContext.ThrowIfPlatformLevelException(DurableTask.Core.FailureDetails)">


Any idea why this is showing up in our public documentation? I thought these APIs were internal?

My guess is that since I added method-level comments to the private method, they're made it to the xml. I think this is probably just to power the intellisense? Not sure, just a guess.

It's not just intellisense, but also the public API docs that we publish, like this: https://learn.microsoft.com/en-us/dotnet/api/microsoft.azure.webjobs.extensions.durabletask?view=azure-dotnet. If it's not meant to be published, then we should remove it from this xml file. Might be worth doing a bit of research as when when/why non-public APIs are getting included in this file.

Seems that adding the method-level comments was the culprit.
I removed them (moved them to the method body), and the xml change was undone: c927268

cgillum · 2024-10-01T23:28:41Z

test/SmokeTests/OOProcSmokeTests/DotNetIsolated/FaultyOrchestrators.cs

+                }
+            }
+
+            // assuming the orchestrator survived the OOM, delete the evidence file and return


I'm not sure I understand this comment, and it's making me question whether I understand what this test is supposed to be doing. I'll ask simply: why would we expect an orchestrator to survive an OOM? It sounds like you're saying there was an OOM but it was handled and we kept running in spite of the OOM. Are you actually saying it recovered from a worker process crash (caused by the OOM)? Maybe OOM implies crash, but it's not 100% clear that's what you meant.

You're right the wording here was not great.

I do not expect an orchestration invocation to continue running beyond an OOM.

What I mean by survive was:

orchestrator invocation A runs

invocation A OOMs

DF extension abandons the invocation

DF extension retries the orchestrator invocation (now invocation B, every invocation is unique)

invocation does not OOM now, and completes

So it "survived" because it retried.

I'm hoping the new comments here make this clearer: a70d014

cgillum · 2024-10-01T23:30:21Z

test/SmokeTests/OOProcSmokeTests/DotNetIsolated/FaultyOrchestrators.cs

+                Environment.FailFast("Simulating crash!");
+            }
+
+            // assuming the orchestrator survived the OOM, delete the evidence file and return


Does this comment need to be updated?

Good catch, this is the result of copy-pasting.
Updated: a70d014

cgillum · 2024-10-01T23:32:28Z

test/SmokeTests/OOProcSmokeTests/DotNetIsolated/FaultyOrchestrators.cs

+
+            // Function input comes from the request content.
+            string instanceId = await client.ScheduleNewOrchestrationInstanceAsync(
+                nameof(TimeoutOrchestrator));


Rather than replicating the same HTTP function once for each scenario, it might make more sense to have a single HTTP trigger that takes an orchestrator function name as a parameter.

I don't feel strongly here, but my personal preference is to have specific HTTP endpoints for each orchestrator. When testing locally, it saves me time to just copy-paste the entire URL to trigger the orchestrator.

But it's no big deal. If you feel strongly, I'll generalize it. Just noting this was a deliberate choice.

I don't feel strongly so I'll leave it up to you, but it would enable all the usual benefits of code sharing (easier to add new tests, better consistency b/c only one place to make updates, etc.)

test/SmokeTests/OOProcSmokeTests/DotNetIsolated/run-smoke-tests.ps1

jviau · 2024-10-01T23:37:49Z

Summarizing some offline discussion:

Handling timeouts, worker crashes, and host out of memories is good
We should not handle worker out of memory. This is language dependent and each language should handle OOM their own way
- This should be left for future work. We will need to see if there is an existing mechanism for a language worker to signal an orchestration should be retried. If not, that mechanism will need to be added.
I will follow up on the host side to see if there is an existing way to detect worker crashes or other platform-level failures (instead of checking exception type by string). If not, we can file an issue for that ask (and proceed with the string type check for now)

davidmrdavid · 2024-10-03T00:09:41Z

Thanks @jviau. I'll delegate this bit to you then:

I will follow up on the host side to see if there is an existing way to detect worker crashes or other platform-level failures (instead of checking exception type by string). If not, we can file an issue for that ask (and proceed with the string type check for now)

As for the rest, I'll see what we can do. OOM is the key case we're trying to unblock via this PR (the rest are related "freebies"). @andystaples will start an internal thread about this.

jviau · 2024-10-03T17:08:37Z

I think all of the points I brought up can also be addressed after this PR, but it is something we should commit to doing ASAP.

Adopting a more 'official' way to identify platform issues / worker crashes
Implementing/consuming a way for workers to elect to abandon a session (which they can then use for OOM scenarios)

jviau · 2024-10-03T17:14:27Z

src/WebJobs.Extensions.DurableTask/OutOfProcMiddleware.cs

+                // - a worker process exit
+                if (functionResult.Exception is Host.FunctionTimeoutException
+                    || functionResult.Exception?.InnerException is SessionAbortedException // see RemoteOrchestrationContext.TrySetResultInternal for details on OOM-handling
+                    || (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false))


Did you verify that WorkerProcessExitException is actually available here? Looking at the function host code, I don't see the exception actually thrown: https://github.com/Azure/azure-functions-host/blob/1088f24c3ae3a6275f18cbf091fa525c2477be91/src/WebJobs.Script/Workers/ProcessManagement/WorkerProcess.cs#L179-L184

Yes, that's how we stumbled upon this error type in the first place. We found this error by forcing a process exit (Environment.FailFast) and placing a breakpoint at this position.

But yeah I admit that I also don't see the explicit throw here. We can investigate further, or I can show you the runtime behavior through the debugger.

jviau · 2024-10-03T20:56:21Z

src/WebJobs.Extensions.DurableTask/OutOfProcMiddleware.cs

+                    || (functionResult.Exception?.InnerException?.GetType().ToString().Contains("WorkerProcessExitException") ?? false))
+                {
+                    // TODO: the `WorkerProcessExitException` type is not exposed in our dependencies, it's part of WebJobs.Host.Script.
+                    // Should we add that dependency or should it be exposed in WebJobs.Host?


WebJobs.Host.Script is not available for public consumption. We will need to work with the host team / WebJobs SDK to expose this information to extensions.

andystaples and others added 4 commits August 19, 2024 13:04

Handle Worker Out of Memory exceptions with retry

a335871

implement type-based solution

f2942b2

add e2e tests

f2ed1e9

fix build warning

d6bb8e4

davidmrdavid force-pushed the dajusto/add-failed-orch-tests branch from bbbb91f to 724ebea Compare October 1, 2024 18:44

checkpoint

d333530

davidmrdavid force-pushed the dajusto/add-failed-orch-tests branch from 724ebea to d333530 Compare October 1, 2024 18:45

davidmrdavid added 5 commits October 1, 2024 11:47

Merge branch 'dev' of https://github.com/Azure/azure-functions-durabl…

d1a3bca

…e-extension into dajusto/add-failed-orch-tests

reset e2e tests

72428b9

reset e2e tests

3aaa919

reset host.json

079f112

reset Dockerfile

065fdd7

andystaples approved these changes Oct 1, 2024

View reviewed changes

davidmrdavid changed the title ~~[WIP] Second test of transient platformnotsupported exceptions~~ Retry platform-level errors in the isolated process for .NET isolated Oct 1, 2024

davidmrdavid commented Oct 1, 2024

View reviewed changes

davidmrdavid marked this pull request as ready for review October 1, 2024 22:58

cgillum reviewed Oct 1, 2024

View reviewed changes

davidmrdavid added 5 commits October 2, 2024 16:49

do not throw oom, throw sessionAbortedException

090f8e7

improve comments on faulty orchs

a70d014

add comments to tester script

2ff4b79

fix typo

68e7503

make faultyorchestrators compile by forcing all paths to return

1cdd4b9

jviau reviewed Oct 3, 2024

View reviewed changes

remove method-level comments, which generate xmls for private methods

c927268

andystaples mentioned this pull request Oct 3, 2024

Add gRPC support for communicating orchestration aborted status from worker to host #2925

Open

davidmrdavid requested a review from jviau October 3, 2024 19:08

jviau approved these changes Oct 3, 2024

View reviewed changes

davidmrdavid merged commit 8a5db71 into dev Oct 3, 2024
14 checks passed

davidmrdavid deleted the dajusto/add-failed-orch-tests branch October 3, 2024 21:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry platform-level errors in the isolated process for .NET isolated #2922

Retry platform-level errors in the isolated process for .NET isolated #2922

davidmrdavid commented Oct 1, 2024 •

edited

Loading

andystaples left a comment

andystaples Oct 1, 2024

davidmrdavid Oct 1, 2024

davidmrdavid Oct 1, 2024

davidmrdavid Oct 1, 2024

davidmrdavid Oct 1, 2024

jviau Oct 3, 2024

cgillum left a comment

cgillum Oct 1, 2024

davidmrdavid Oct 2, 2024

davidmrdavid Oct 3, 2024

cgillum Oct 1, 2024

davidmrdavid Oct 3, 2024

cgillum Oct 1, 2024

davidmrdavid Oct 3, 2024

cgillum Oct 3, 2024

davidmrdavid Oct 3, 2024

cgillum Oct 1, 2024

davidmrdavid Oct 3, 2024

cgillum Oct 1, 2024

davidmrdavid Oct 3, 2024

cgillum Oct 1, 2024

davidmrdavid Oct 2, 2024 •

edited

Loading

cgillum Oct 3, 2024

jviau commented Oct 1, 2024

davidmrdavid commented Oct 3, 2024

jviau commented Oct 3, 2024

jviau Oct 3, 2024

davidmrdavid Oct 3, 2024

jviau Oct 3, 2024

		"maxQueuePollingInterval": "00:00:01",
		"controlQueueVisibilityTimeout": "00:01:00"

Retry platform-level errors in the isolated process for .NET isolated #2922

Retry platform-level errors in the isolated process for .NET isolated #2922

Conversation

davidmrdavid commented Oct 1, 2024 • edited Loading

andystaples left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cgillum left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidmrdavid Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jviau commented Oct 1, 2024

davidmrdavid commented Oct 3, 2024

jviau commented Oct 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidmrdavid commented Oct 1, 2024 •

edited

Loading

davidmrdavid Oct 2, 2024 •

edited

Loading