Skip to content

A few fixes in the threadpool semaphore. Unify Windows/Unix implementation of LIFO policy.#123921

Merged
VSadov merged 25 commits intodotnet:mainfrom
VSadov:lifo
Feb 15, 2026
Merged

A few fixes in the threadpool semaphore. Unify Windows/Unix implementation of LIFO policy.#123921
VSadov merged 25 commits intodotnet:mainfrom
VSadov:lifo

Conversation

@VSadov
Copy link
Member

@VSadov VSadov commented Feb 2, 2026

Re: #123159

Changes:

  • Correctly handle Backoff.Exponential(0).

Embarrassing bug.
To get exponentially growing random spin count for an iteration we generate pseudorandom uint and do >> (32 - attempt). Since C# masks the shift operand with 31, when attempt==0 we end up not shifting at all, and the first iteration gets a large random spin count.
That caused many noisy results and interestingly some improvements (in scenarios that benefit from very long spins).

  • Unified implementation of LIFO policy with lightweight minimal implementation of LIFO waiting.

Once we are done spinning, we block threads and when workers are needed again wake them in LIFO order.

Unix WaitSubsystem is pretty heavy for these needs. It supports Interruptible waits, waiting on multiple objects, etc... None of that is interesting here. Most calls into the subsystem take a global process-wide lock which can contend under load with other uses, or a worker-waking threads may contend with the workers going to sleep, etc...

Windows used an opaque GetQueuedCompletionStatus for the side effect of releasing threads in LIFO order when completion is posted, with unknown overheads and interactions, even though typically it is more efficient than Unix WaitSubsystem.

The portable implementation seems to be faster than either of the platform-specific ones.
(measured by disabling spinning and running a few latency-sensitive benchmarks).

The portable implementation is also easier to reason about and to debug anomalies.

  • Adaptive spinning in the threadpool based on estimates of CPU core availability.

Spinning in threadpool is very tricky and spinning benefits differ greatly between scenarios. For some scenarios the longer the spin the better. But there are scenarios that benefit when the threadpool releases cores quickly once it sees no work. No preset fixed spin count is going to be good for everything.

Adaptive approach appears to be necessary to improve some scenarios without regressing many others.
We can further improve the heuristic, if there are more ideas.

@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @agocke, @VSadov
See info in area-owners.md if you want to be subscribed.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses performance regressions in the threadpool semaphore (issue #123159) and unifies the Windows/Unix implementation of the LIFO (Last-In-First-Out) policy for threadpool worker thread management.

Changes:

  • Introduces a unified LowLevelThreadBlocker class that uses OS-provided compare-and-wait APIs (futex on Linux, WaitOnAddress on Windows) for efficient thread blocking, with a fallback to monitor-based implementation for other platforms
  • Refactors LowLevelLifoSemaphore to use the new blocker infrastructure, removes platform-specific Windows/Unix implementations, and improves spinning heuristics based on CPU availability
  • Adds native futex support for Linux through syscalls and Windows WaitOnAddress API interop

Reviewed changes

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/native/libs/System.Native/pal_threading.h Adds declarations for Linux futex operations
src/native/libs/System.Native/pal_threading.c Implements futex wait/wake operations for Linux using syscalls
src/native/libs/System.Native/entrypoints.c Registers new futex entrypoints for Linux
src/libraries/System.Private.CoreLib/src/System/Threading/PortableThreadPool.WorkerThread.cs Fixes spelling, removes spin count configuration, passes active thread count to semaphore
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelThreadBlocker.cs New class providing portable thread blocking using futex/WaitOnAddress or monitor fallback
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.cs Major refactoring to use LowLevelThreadBlocker, implements LIFO queue with pending signals, improves spin heuristics
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Windows.cs Deleted - functionality moved to unified implementation
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelLifoSemaphore.Unix.cs Deleted - functionality moved to unified implementation
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelFutex.Windows.cs New file providing Windows WaitOnAddress API wrapper
src/libraries/System.Private.CoreLib/src/System/Threading/LowLevelFutex.Unix.cs New file providing Linux futex wrapper
src/libraries/System.Private.CoreLib/src/System/Threading/Backoff.cs Modified to return spin count and skip spinning on first attempt
src/libraries/System.Private.CoreLib/src/System.Private.CoreLib.Shared.projitems Updates project to include new files and remove deleted platform-specific files
src/libraries/Common/src/Interop/Windows/Kernel32/Interop.WaitOnAddress.cs New interop declarations for Windows WaitOnAddress and WakeByAddressSingle APIs
src/libraries/Common/src/Interop/Windows/Kernel32/Interop.CriticalSection.cs Adds SuppressGCTransition attribute to LeaveCriticalSection
src/libraries/Common/src/Interop/Windows/Kernel32/Interop.ConditionVariable.cs Adds SuppressGCTransition attribute to WakeConditionVariable
src/libraries/Common/src/Interop/Unix/System.Native/Interop.LowLevelMonitor.cs Adds SuppressGCTransition attributes to Release and Signal_Release
src/libraries/Common/src/Interop/Unix/System.Native/Interop.Futex.cs New interop declarations for Linux futex operations

Copilot AI review requested due to automatic review settings February 3, 2026 00:16
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 4 comments.

Copilot AI review requested due to automatic review settings February 3, 2026 01:35
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 6 comments.

@VSadov
Copy link
Member Author

VSadov commented Feb 3, 2026

One test that was affected by #123159 is
System.Buffers.Tests.RentReturnArrayPoolTests<Byte>.ProducerConsumer

The test involved one thread renting an array, mutating it, passing to another thread via one-element buffer, the other thread would inspect the buffer and release, and so on.
In particular there is a scenario where both sides wait synchronously on async result of a buffer operation. There is an occasional race condition when IsCompleted on the buffer operation returns false, but subsequent OnCompleted sees a completed async operation. Since it can`t attach continuation to already completed result, it posts a workitem to the threadpool and attaches continuation to that. As a result, once in a while one of the threads that plays buffer ping-pong effectively waits on the completion of such workitem.

Since the scenario needs to wait on a task only occasionally, depending on environment (CPU speed, memory speed, ...), it varies how frequently the need arises for such task, but generally the test is sensitive to threadpool spinning long enough to execute a task without waking a thread.

The results after this PR, vs baseline:

=== Linux x64
(azure VM, so it is what it is, but the test has little of guest/host interactions - not IO-heavy at all)

  • baseline:
Method RentalSize ManipulateArray Async UseSharedPool Mean Error StdDev Median Min Max Gen0 Allocated
ProducerConsumer 4096 False False False 1.787 us 0.0995 us 0.1146 us 1.800 us 1.5178 us 1.997 us 0.0050 84 B
ProducerConsumer 4096 False False True 1.836 us 0.1398 us 0.1610 us 1.859 us 1.3978 us 2.057 us - 82 B
ProducerConsumer 4096 False True False 1.035 us 0.0642 us 0.0739 us 1.052 us 0.7293 us 1.074 us - -
ProducerConsumer 4096 False True True 1.095 us 0.0601 us 0.0692 us 1.122 us 0.8213 us 1.135 us - -
ProducerConsumer 4096 True False False 1.927 us 0.0647 us 0.0692 us 1.939 us 1.7744 us 2.025 us - 6 B
ProducerConsumer 4096 True False True 1.952 us 0.0584 us 0.0672 us 1.952 us 1.7911 us 2.045 us - 2 B
ProducerConsumer 4096 True True False 1.494 us 0.0366 us 0.0422 us 1.491 us 1.4217 us 1.575 us - -
ProducerConsumer 4096 True True True 1.875 us 0.0916 us 0.1055 us 1.879 us 1.6830 us 2.075 us - -
  • after the PR:
    (mostly improvements, some scenarios really like long spins though....)
Method RentalSize ManipulateArray Async UseSharedPool Mean Error StdDev Median Min Max Gen0 Allocated
ProducerConsumer 4096 False False False 1,706.1 ns 122.59 ns 141.18 ns 1,709.4 ns 1,486.7 ns 1,921.7 ns 0.0050 83 B
ProducerConsumer 4096 False False True 1,737.3 ns 134.77 ns 155.20 ns 1,723.8 ns 1,558.4 ns 2,107.2 ns - 83 B
ProducerConsumer 4096 False True False 855.6 ns 25.55 ns 28.40 ns 860.4 ns 740.1 ns 869.3 ns - -
ProducerConsumer 4096 False True True 1,035.9 ns 21.40 ns 23.79 ns 1,038.6 ns 965.3 ns 1,065.8 ns - -
ProducerConsumer 4096 True False False 1,450.8 ns 53.79 ns 57.55 ns 1,446.4 ns 1,326.9 ns 1,572.2 ns - 1 B
ProducerConsumer 4096 True False True 2,383.5 ns 261.28 ns 300.89 ns 2,273.3 ns 2,037.4 ns 3,040.7 ns - 9 B
ProducerConsumer 4096 True True False 1,503.2 ns 57.55 ns 66.28 ns 1,512.2 ns 1,375.6 ns 1,584.2 ns - -
ProducerConsumer 4096 True True True 1,842.5 ns 68.79 ns 79.22 ns 1,837.4 ns 1,712.4 ns 2,035.9 ns - -

@VSadov
Copy link
Member Author

VSadov commented Feb 3, 2026

Same tests on Windows:
(clearly an improvement)

BenchmarkDotNet v0.14.1-nightly.20250107.205, Windows 11 (10.0.26200.7623)
AMD Ryzen 9 7950X 4.50GHz, 1 CPU, 32 logical and 16 physical cores

=== baseline:

Method RentalSize ManipulateArray Async UseSharedPool Mean Error StdDev Median Min Max Gen0 Allocated
ProducerConsumer 4096 False False False 552.4 ns 23.25 ns 25.84 ns 549.7 ns 514.5 ns 601.0 ns 0.0050 84 B
ProducerConsumer 4096 False False True 734.0 ns 26.71 ns 29.69 ns 730.6 ns 671.1 ns 795.4 ns 0.0025 83 B
ProducerConsumer 4096 False True False 325.9 ns 13.82 ns 14.19 ns 325.1 ns 304.5 ns 358.8 ns - -
ProducerConsumer 4096 False True True 367.9 ns 7.89 ns 9.09 ns 369.1 ns 332.3 ns 376.9 ns - -
ProducerConsumer 4096 True False False 1,390.5 ns 447.74 ns 515.62 ns 1,050.2 ns 959.7 ns 2,085.7 ns - 58 B
ProducerConsumer 4096 True False True 1,380.9 ns 32.70 ns 37.65 ns 1,386.5 ns 1,286.9 ns 1,437.0 ns - 32 B
ProducerConsumer 4096 True True False 886.0 ns 17.08 ns 18.28 ns 889.2 ns 852.0 ns 922.1 ns - -
ProducerConsumer 4096 True True True 1,012.1 ns 19.45 ns 18.20 ns 1,007.6 ns 979.7 ns 1,043.3 ns - -

=== this PR:

Method RentalSize ManipulateArray Async UseSharedPool Mean Error StdDev Median Min Max Gen0 Allocated
ProducerConsumer 4096 False False False 395.7 ns 26.11 ns 25.64 ns 384.3 ns 373.3 ns 462.0 ns 0.0050 84 B
ProducerConsumer 4096 False False True 498.8 ns 73.72 ns 81.94 ns 480.5 ns 399.9 ns 667.2 ns 0.0050 83 B
ProducerConsumer 4096 False True False 249.4 ns 4.31 ns 3.37 ns 249.8 ns 243.8 ns 257.0 ns - -
ProducerConsumer 4096 False True True 321.5 ns 5.30 ns 4.42 ns 319.3 ns 317.4 ns 330.1 ns - -
ProducerConsumer 4096 True False False 960.5 ns 30.51 ns 35.14 ns 955.2 ns 914.2 ns 1,035.8 ns - 5 B
ProducerConsumer 4096 True False True 1,255.1 ns 25.08 ns 24.63 ns 1,256.5 ns 1,209.8 ns 1,304.5 ns - 30 B
ProducerConsumer 4096 True True False 883.0 ns 15.86 ns 14.83 ns 882.2 ns 863.3 ns 917.1 ns - -
ProducerConsumer 4096 True True True 1,056.4 ns 19.67 ns 18.40 ns 1,059.9 ns 1,014.1 ns 1,087.3 ns - -

@VSadov
Copy link
Member Author

VSadov commented Feb 3, 2026

TE benchmarks seem to favor the change as well.

Unlike ProducerConsumer microbenchmark, TE does not like long threadpool spins, likely because there are non-threadpool threads like epoll threads.
This shows that threadpool heuristics can do the right adjustments.

Using command:

crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/json.benchmarks.yml --scenario json    --profile aspnet-gold-lin  --application.framework net11.0 --application.options.outputFiles <. . .>

=== Baseline:

| First Request (ms)        | 172                 |
| Requests/sec              | 1,828,617           |
| Requests                  | 27,611,979          |
| Mean latency (ms)         | 0.14                |
| Max latency (ms)          | 12.27               |
| Bad responses             | 0                   |
| Socket errors             | 0                   |
| Read throughput (MB/s)    | 291.23              |
| Latency 50th (ms)         | 0.12                |
| Latency 75th (ms)         | 0.16                |
| Latency 90th (ms)         | 0.22                |
| Latency 99th (ms)         | 0.37                |

=== This PR:

| First Request (ms)        | 171                 |
| Requests/sec              | 1,846,521           |
| Requests                  | 27,882,744          |
| Mean latency (ms)         | 0.14                |
| Max latency (ms)          | 7.00                |
| Bad responses             | 0                   |
| Socket errors             | 0                   |
| Read throughput (MB/s)    | 294.08              |
| Latency 50th (ms)         | 0.12                |
| Latency 75th (ms)         | 0.16                |
| Latency 90th (ms)         | 0.22                |
| Latency 99th (ms)         | 0.37                |

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated no new comments.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 17 out of 17 changed files in this pull request and generated 3 comments.

Copilot AI review requested due to automatic review settings February 14, 2026 22:22
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.

@VSadov VSadov merged commit fbf6be5 into dotnet:main Feb 15, 2026
176 checks passed
@VSadov VSadov deleted the lifo branch February 15, 2026 02:14
@VSadov
Copy link
Member Author

VSadov commented Feb 15, 2026

Thanks!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants