Fix: Host shutdown blocked by extended session idle timeout#1311
Fix: Host shutdown blocked by extended session idle timeout#1311andystaples merged 5 commits intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Fixes Azure Storage backend shutdown delays when extended sessions are enabled by ensuring session waits are cancellable and by honoring forced shutdown semantics to avoid waiting for sessions to drain.
Changes:
- Propagates a shutdown
CancellationTokenintoOrchestrationSessionand uses it to cancel idle waits inFetchNewOrchestrationMessagesAsync. - Adds
OrchestrationSessionManager.AbortAllSessions()and calls it fromAzureStorageOrchestrationService.StopAsync(isForced: true)to unblock draining during forced shutdown. - Adds new tests intended to validate cancellation behavior and forced shutdown session abort behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| src/DurableTask.AzureStorage/OrchestrationSessionManager.cs | Stores shutdown token for sessions; adds AbortAllSessions() to unblock draining during forced shutdown. |
| src/DurableTask.AzureStorage/Messaging/OrchestrationSession.cs | Cancels extended-session idle waits immediately on shutdown token cancellation. |
| src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs | Honors isForced by aborting sessions before stopping lease manager. |
| Test/DurableTask.AzureStorage.Tests/OrchestrationSessionTests.cs | Adds tests for cancellation/forced shutdown behavior (but currently not wired into the repo’s test project). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
…ect' Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
…://github.com/Azure/durabletask into andystaples/fix-extended-sessions-polling-bug
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| { | ||
| this.activeOrchestrationSessions.Clear(); |
There was a problem hiding this comment.
AbortAllSessions() clears only activeOrchestrationSessions, but leaves pendingOrchestrationMessageBatches and the ready-for-processing AsyncQueue buffers intact. Because OrchestrationSessionManager is reused across StartAsync/StopAsync, a forced stop could leave stale in-memory batches that may get processed after a subsequent StartAsync (or just retain memory). Consider also clearing the pending/ready buffers (or recreating the OrchestrationSessionManager on restart) when aborting sessions.
| { | |
| this.activeOrchestrationSessions.Clear(); | |
| { | |
| // Clear any active sessions as well as all pending in-memory message batches | |
| // so that no stale work is processed after a subsequent StartAsync. | |
| this.activeOrchestrationSessions.Clear(); | |
| this.pendingOrchestrationMessageBatches.Clear(); |
| // Use reflection to access the internal sessions dictionary. | ||
| var sessionsField = typeof(OrchestrationSessionManager) | ||
| .GetField("activeOrchestrationSessions", BindingFlags.NonPublic | BindingFlags.Instance); | ||
| var sessions = (Dictionary<string, OrchestrationSession>)sessionsField.GetValue(manager); |
There was a problem hiding this comment.
This test uses reflection to reach into the private activeOrchestrationSessions field and assumes both the exact field name and concrete type. That makes the test brittle (a simple rename/type change becomes a runtime NullReferenceException/InvalidCastException). Consider asserting sessionsField != null before GetValue, and prefer exercising the manager via exposed/internal APIs (or an internal test hook) to avoid hard-coding private implementation details.
| var sessions = (Dictionary<string, OrchestrationSession>)sessionsField.GetValue(manager); | |
| Assert.IsNotNull( | |
| sessionsField, | |
| "Expected OrchestrationSessionManager to have a non-public instance field named 'activeOrchestrationSessions'."); | |
| var sessions = sessionsField.GetValue(manager) as IDictionary<string, OrchestrationSession>; | |
| Assert.IsNotNull( | |
| sessions, | |
| "Expected 'activeOrchestrationSessions' field to be assignable to IDictionary<string, OrchestrationSession>."); |
Problem
When extended sessions are enabled, every host stop is delayed by exactly ExtendedSessionIdleTimeoutInSeconds (default 30s). This affects deployments, scale-down events, and restarts.
Root Cause
Two bugs in the shutdown path:
Bug 1 — No cancellation token on session wait: OrchestrationSession.FetchNewOrchestrationMessagesAsync calls messagesAvailableEvent.WaitAsync(this.idleTimeout) without a cancellation token. During shutdown, shutdownSource is cancelled, but this cancellation never reaches active sessions — each session blocks for the full idle timeout before it can be drained. The AsyncAutoResetEvent already supports a CancellationToken overload; it just wasn't being used.
Bug 2 — isForced parameter ignored: AzureStorageOrchestrationService.StopAsync(bool isForced) completely ignores the isForced parameter. Whether forced or graceful, shutdown always waits for sessions to drain naturally.
Fix
Bug 1: Pass the shutdownSource.Token through OrchestrationSessionManager into each OrchestrationSession, and use the two-argument WaitAsync(timeout, cancellationToken) overload in FetchNewOrchestrationMessagesAsync. When the shutdown token is cancelled, the wait exits immediately (returning null), which causes the DTFx dispatcher to close the session.
Bug 2: Add OrchestrationSessionManager.AbortAllSessions() which clears all active sessions, unblocking DrainAsync immediately. AzureStorageOrchestrationService.StopAsync now calls this method when isForced is true, before awaiting appLeaseManager.StopAsync().