Skip to content

Fix: Host shutdown blocked by extended session idle timeout#1311

Merged
andystaples merged 5 commits intomainfrom
andystaples/fix-extended-sessions-polling-bug
Mar 10, 2026
Merged

Fix: Host shutdown blocked by extended session idle timeout#1311
andystaples merged 5 commits intomainfrom
andystaples/fix-extended-sessions-polling-bug

Conversation

@andystaples
Copy link
Contributor

@andystaples andystaples commented Mar 6, 2026

Problem
When extended sessions are enabled, every host stop is delayed by exactly ExtendedSessionIdleTimeoutInSeconds (default 30s). This affects deployments, scale-down events, and restarts.

Root Cause
Two bugs in the shutdown path:

Bug 1 — No cancellation token on session wait: OrchestrationSession.FetchNewOrchestrationMessagesAsync calls messagesAvailableEvent.WaitAsync(this.idleTimeout) without a cancellation token. During shutdown, shutdownSource is cancelled, but this cancellation never reaches active sessions — each session blocks for the full idle timeout before it can be drained. The AsyncAutoResetEvent already supports a CancellationToken overload; it just wasn't being used.

Bug 2 — isForced parameter ignored: AzureStorageOrchestrationService.StopAsync(bool isForced) completely ignores the isForced parameter. Whether forced or graceful, shutdown always waits for sessions to drain naturally.

Fix
Bug 1: Pass the shutdownSource.Token through OrchestrationSessionManager into each OrchestrationSession, and use the two-argument WaitAsync(timeout, cancellationToken) overload in FetchNewOrchestrationMessagesAsync. When the shutdown token is cancelled, the wait exits immediately (returning null), which causes the DTFx dispatcher to close the session.

Bug 2: Add OrchestrationSessionManager.AbortAllSessions() which clears all active sessions, unblocking DrainAsync immediately. AzureStorageOrchestrationService.StopAsync now calls this method when isForced is true, before awaiting appLeaseManager.StopAsync().

Copilot AI review requested due to automatic review settings March 6, 2026 23:07
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes Azure Storage backend shutdown delays when extended sessions are enabled by ensuring session waits are cancellable and by honoring forced shutdown semantics to avoid waiting for sessions to drain.

Changes:

  • Propagates a shutdown CancellationToken into OrchestrationSession and uses it to cancel idle waits in FetchNewOrchestrationMessagesAsync.
  • Adds OrchestrationSessionManager.AbortAllSessions() and calls it from AzureStorageOrchestrationService.StopAsync(isForced: true) to unblock draining during forced shutdown.
  • Adds new tests intended to validate cancellation behavior and forced shutdown session abort behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
src/DurableTask.AzureStorage/OrchestrationSessionManager.cs Stores shutdown token for sessions; adds AbortAllSessions() to unblock draining during forced shutdown.
src/DurableTask.AzureStorage/Messaging/OrchestrationSession.cs Cancels extended-session idle waits immediately on shutdown token cancellation.
src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs Honors isForced by aborting sessions before stopping lease manager.
Test/DurableTask.AzureStorage.Tests/OrchestrationSessionTests.cs Adds tests for cancellation/forced shutdown behavior (but currently not wired into the repo’s test project).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

…ect'

Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 6, 2026 23:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot AI review requested due to automatic review settings March 10, 2026 18:40
@andystaples andystaples merged commit 06d542d into main Mar 10, 2026
4 of 5 checks passed
@andystaples andystaples deleted the andystaples/fix-extended-sessions-polling-bug branch March 10, 2026 18:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +672 to +673
{
this.activeOrchestrationSessions.Clear();
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AbortAllSessions() clears only activeOrchestrationSessions, but leaves pendingOrchestrationMessageBatches and the ready-for-processing AsyncQueue buffers intact. Because OrchestrationSessionManager is reused across StartAsync/StopAsync, a forced stop could leave stale in-memory batches that may get processed after a subsequent StartAsync (or just retain memory). Consider also clearing the pending/ready buffers (or recreating the OrchestrationSessionManager on restart) when aborting sessions.

Suggested change
{
this.activeOrchestrationSessions.Clear();
{
// Clear any active sessions as well as all pending in-memory message batches
// so that no stale work is processed after a subsequent StartAsync.
this.activeOrchestrationSessions.Clear();
this.pendingOrchestrationMessageBatches.Clear();

Copilot uses AI. Check for mistakes.
// Use reflection to access the internal sessions dictionary.
var sessionsField = typeof(OrchestrationSessionManager)
.GetField("activeOrchestrationSessions", BindingFlags.NonPublic | BindingFlags.Instance);
var sessions = (Dictionary<string, OrchestrationSession>)sessionsField.GetValue(manager);
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test uses reflection to reach into the private activeOrchestrationSessions field and assumes both the exact field name and concrete type. That makes the test brittle (a simple rename/type change becomes a runtime NullReferenceException/InvalidCastException). Consider asserting sessionsField != null before GetValue, and prefer exercising the manager via exposed/internal APIs (or an internal test hook) to avoid hard-coding private implementation details.

Suggested change
var sessions = (Dictionary<string, OrchestrationSession>)sessionsField.GetValue(manager);
Assert.IsNotNull(
sessionsField,
"Expected OrchestrationSessionManager to have a non-public instance field named 'activeOrchestrationSessions'.");
var sessions = sessionsField.GetValue(manager) as IDictionary<string, OrchestrationSession>;
Assert.IsNotNull(
sessions,
"Expected 'activeOrchestrationSessions' field to be assignable to IDictionary<string, OrchestrationSession>.");

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants