fixing a bug in card mark stealing#117968
Merged
Maoni0 merged 1 commit intodotnet:mainfrom Jul 24, 2025
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR fixes a race condition bug in the garbage collector's card mark stealing mechanism where multiple threads could incorrectly manage card table state across 2MB boundaries. The fix prevents one thread from clearing cards beyond its assigned 2MB card stealing unit, which could lead to inconsistent state between card bits and card bundle bits.
- Introduces proper boundary checking when clearing cards in
card_transition - Ensures card clearing operations are clamped to the card stealing unit limit
- Prevents race conditions that could leave cards set without corresponding card bundle bits
Comments suppressed due to low confidence (2)
This was referenced Jul 23, 2025
Open
Contributor
|
Tagging subscribers to this area: @dotnet/gc |
mangod9
approved these changes
Jul 23, 2025
Member
Author
|
I also did some stress runs and didn't find any problems. |
Member
Author
|
/ba-g Known issue dotnet/dnceng#6004 |
Member
Author
|
/backport to release/8.0-staging |
Contributor
|
Started backporting to release/8.0-staging: https://github.com/dotnet/runtime/actions/runs/16489375887 |
4 tasks
Maoni0
pushed a commit
that referenced
this pull request
Jul 30, 2025
Backport of #117968 to release/8.0-staging
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
this fixes a problem with card mark stealing where we missed clamping the card clearing by the card stealing unit in
card_transition. for this bug to appear the following conditions need to be met -an object A straddles the 2mb card stealing unit and originally for that object a card below the 2mb boundary and a card that corresponds to at least 256 bytes above the 2mb boundary are set. and there are no reference fields inbetween.
one thread T0 is working on the 1st 2mb and discovers A and the first set card bit. this card doesn't need to be set, so
poois set the address that's described by the 2nd card since there're no reference fields inbetween. socard_transitionis called which will callclear_cardson [1st card, (2nd card. and it stops at this line -card_table [end_word] &= highbits (~0, bits);where it sees
end_cardwith the 2nd card still set, but before it writes it back tocard_table[end_word]meanwhile, another thread T1 needs to be working on the memory starting from this 2mb boundary. it discovers the 2nd card doesn't need to be set, and none of the cards that correspond to the card bundle bit needs to be set so it clears the cards and the card bundle bit.
now T0 writes back to
card_table[end_word]with the 2nd card bit set.it's not a problem when a card that shouldn't be set is set, given that its corresponding card bundle bit is also set. but it's definitely a problem if a card is set but its card bundle bit isn't, because next time when we have a cross gen reference, what's supposed to happen in the write barrier is either the card isn't already set and the WB will set the card and its corresponding card bundle bit, or the card is set and the WB wouldn't do anything. but now we have a situation where the card is set but the card bundle bit isn't, it just means the next GC that should be looking at this card wouldn't, if there were no other cards covered by that card bundle bit got newly set by the WB.
the cleanest fix is to make sure we don't step outside of the 2mb boundary when we call
clear_cardsincard_transition.this issue was very hard to observe and debug - full credit goes to @ChrisAhna who also verified the fix.