Fix bug failing CS-RAA resend order on pending commitment signatures#3149
Conversation
|
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #3149 +/- ##
==========================================
+ Coverage 89.81% 90.60% +0.79%
==========================================
Files 121 121
Lines 99314 105075 +5761
Branches 99314 105075 +5761
==========================================
+ Hits 89195 95205 +6010
+ Misses 7514 7313 -201
+ Partials 2605 2557 -48 ☔ View full report in Codecov by Sentry. |
fe7afcf to
536b6b4
Compare
TheBlueMatt
left a comment
There was a problem hiding this comment.
Thanks! Good test, too, though I need to go look at the test coverage changes. I assume this was hit in prod? We should get the rest of the async signing stuff upstream so we can spend some cycles fuzzing.
| let commitment_signed = if let Ok(update) = self.send_commitment_no_state_update(logger).map(|(cu, _)| cu) { | ||
| if self.context.signer_pending_commitment_update { | ||
| log_trace!(logger, "Commitment update generated: clearing signer_pending_commitment_update"); | ||
| self.context.signer_pending_commitment_update = false; |
There was a problem hiding this comment.
Hmmmm, I'm a bit skeptical of this. get_last_commitment_update_for_send is called in:
monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it callssigner_maybe_unblocked(c) the monitor persistence completes, causingget_last_commitment_update_for_sendto return a valid commitment which we send to our counterparty, then (d) the user callssigner_maybe_unblockedand we send it again, possibly causing the peer to FC on us.channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us theirchannel_reestablish, then (c) the user callssigner_maybe_unblockedwhich causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsettingsigner_pending_commitment_updatewhen a peer disconnects.
In any case, we should almost certainly add a test for the two above cases.
There was a problem hiding this comment.
monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.
I added a test for this, and I was expecting it to fail and need to change things, but it worked. I realized, i think it won't double send because before the monitor update completes it never calls the signer method, so the pending flag is never set. Even though it works, I think I'll change it back, I can see how setting the flags whenever the signer method is called is less prone to accidental double sends.
channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.
great catch! yea this fails with my change. changing it back!
alecchendev
left a comment
There was a problem hiding this comment.
I assume this was hit in prod?
Yep, have had a couple related issues here in the past
We should get the rest of the async signing stuff upstream so we can spend some cycles fuzzing.
SGTM!
| let commitment_signed = if let Ok(update) = self.send_commitment_no_state_update(logger).map(|(cu, _)| cu) { | ||
| if self.context.signer_pending_commitment_update { | ||
| log_trace!(logger, "Commitment update generated: clearing signer_pending_commitment_update"); | ||
| self.context.signer_pending_commitment_update = false; |
There was a problem hiding this comment.
monitor_updating_restored, which implies to me I think maybe if the user is doing both async monitor updating and async signing they may end up double-sending CS if (a) an update to the channel generates a monitor and signing update, (b) the signer completes but before it calls signer_maybe_unblocked (c) the monitor persistence completes, causing get_last_commitment_update_for_send to return a valid commitment which we send to our counterparty, then (d) the user calls signer_maybe_unblocked and we send it again, possibly causing the peer to FC on us.
I added a test for this, and I was expecting it to fail and need to change things, but it worked. I realized, i think it won't double send because before the monitor update completes it never calls the signer method, so the pending flag is never set. Even though it works, I think I'll change it back, I can see how setting the flags whenever the signer method is called is less prone to accidental double sends.
channel_reestablish, where I think we can hit a similar double-send when reconnecting to our counterparty where we (a) get a new update to sign, (b) the peer disconnects + reconnects, sending us their channel_reestablish, then (c) the user calls signer_maybe_unblocked which causes us to re-send the CS. This one may actually already be broken today, and implies we should be unsetting signer_pending_commitment_update when a peer disconnects.
great catch! yea this fails with my change. changing it back!
536b6b4 to
119b703
Compare
| fn do_test_async_commitment_signature_ordering(monitor_update_failure: bool, | ||
| enable_signer_before_monitor_completion: bool, enable_signer_before_reestablish: bool) { |
There was a problem hiding this comment.
after the latest additional tests this helper is a bit unwieldy, not sure the best way to address it. I could copy this test 4 times and get rid of the conditionals, but it adds a lot of filler...
There was a problem hiding this comment.
yaknow actually i realized for what these are testing it isn't necessary to include them in this test (they don't care about the CS-RAA ordering, just that the CS signer pending flag is cleared at the right time), i'll just move them to their own separate tests tomorrow
There was a problem hiding this comment.
okay, fixed this, moved them into the peer disconnection test
ca37547 to
221ef18
Compare
| revoke_and_ack, | ||
| funding_signed, | ||
| channel_ready, | ||
| order: self.context.resend_order.clone(), |
There was a problem hiding this comment.
Nit: clone() is not needed on order
| order: self.context.resend_order.clone(), | |
| order: self.context.resend_order, |
There was a problem hiding this comment.
hmm, i think it actually does require clone since RAACommitmentOrder doesn't implement Copy?
There was a problem hiding this comment.
OK. RAACommitmentOrder is a primitive enum, but it indeed does not have Copy trait. I tried it without and it worked, but to be safe for future changes, either the clone is needed or the enum should be marked explicitly Copy. It's fine with the clone.
TheBlueMatt
left a comment
There was a problem hiding this comment.
LGTM, feel free to squash the fixups.
| let bs_second_commitment_signed = get_htlc_update_msgs!(nodes[1], nodes[0].node.get_our_node_id()); | ||
| check_added_monitors!(nodes[1], 1); | ||
|
|
||
| // The rest of this is boilerplate for resolving the previous state. |
There was a problem hiding this comment.
At least some of this may be replaceable with one of the inscruitable commitment_signed_dance variants.
There was a problem hiding this comment.
i feared this day would come...
I tried my best, but didn't get very far, it seems this is a little different from the normal exchange so i'm not sure how much more time i should spend trying:
nodes0 nodes1
uah ->
cs ->
raa ->
<- raa
<- cs
cs ->
raa ->
<- raa
There was a problem hiding this comment.
I think the first few messages (after the uah) could be, but if you fought with it a bit I'm not gonna nitpick.
221ef18 to
315193d
Compare
|
Merging as the only changes since @optout21's "LGTM" are addressing his feedback. |
Across disconnects we may end up in a situation where we need to send a
commitment_signedand thenrevoke_and_ack. We need to make sure that if the signer is pending for CS but not RAA, we don't screw up the order by sending the RAA first. We defer sending the RAA by setting the flagsigner_pending_revoke_and_ack, which will lead to us generating an RAA uponsigner_unblocked.We test this for both the case where we send messages after a channel reestablish, as well as restoring a channel after persisting a monitor update asynchronously.