OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite#30950
OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite#30950lucaconsalvi wants to merge 9 commits intoopenshift:mainfrom
Conversation
Introduces the openshift/two-node-regression suite with 5 regression tests that validate podman-etcd resource agent behavior under disruptive conditions: - OCP-88178: learner_node CRM attribute cleanup during stop/start - OCP-88179: active resource count excludes stopping resources - OCP-88180: simultaneous stop delay prevents WAL corruption - OCP-88181: coordinated recovery after etcd container kill - OCP-88213: attribute retry during force-new-cluster recovery Also adds shared pacemaker/CRM utilities to utils/common.go and updates the openshift/two-node suite qualifier to exclude regression tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tomatically skip these tests on SNO environments.
- Handle GetNodes error in AfterEach to prevent nil panic - Scope verifyEtcdCloneStartedOnAllNodes to etcd-clone block only - Add per-node pacemaker log baselines to prevent stale log matches - Fail test on log retrieval errors instead of silently skipping - Poll for learner_node attribute with Eventually for async monitor Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The podman kill command can fail if etcd is already stopped or restarting. Append '; true' to tolerate non-zero exit codes since the real assertion is whether the cluster recovers, not whether the kill command succeeded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…de suite Move all 5 resilience tests from the separate tnf_resilience.go file into the existing Describe block in tnf_recovery.go, eliminating the need for a dedicated openshift/tnf-resilience suite. The tests now run as part of the openshift/two-node suite alongside the existing recovery tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the 5 etcd disruption tests from tnf_recovery.go into tnf_etcd_disruption.go with their own Describe, BeforeEach, and AfterEach blocks. This ensures the disruption cleanup only runs for disruption tests, not for recovery tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a test that verifies the podman-etcd resource agent correctly identifies when the active peer is a learner (non-voter) and starts normally instead of joining as a learner itself, preventing a two-learner deadlock. The test uses standby/unstandby to trigger force_new_cluster recovery, then spoofs CRM attributes (standalone_node, learner_node, revision) to simulate a learner with a higher cached revision than the voter. The AfterEach cleanup is extended to handle the new attributes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Skipping CI for Draft Pull Request. |
|
@lucaconsalvi: This pull request references OCPEDGE-2434 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@lucaconsalvi: This pull request references OCPEDGE-2436 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: lucaconsalvi The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary
is_standalone()learner detection logic in the podman-etcd resource agent(OCPEDGE-2434)
Test
is_standalone()correctly identifies that a learner peer with a higher revision should not cause the voter to join as learner, preventing a two-learner deadlockDetails
When a node restarts after
force_new_clusterrecovery and finds one active peer, thestartup logic checks whether the peer is a learner (non-voter) via the
learner_nodeCRMattribute. Without the fix, the code does not distinguish learners from voters when
comparing revisions — a learner with a cached higher revision could trick the voter into
joining as a learner itself, creating a two-learner deadlock.
The test uses standby/unstandby to trigger
force_new_clusterrecovery, then spoofs CRMattributes (
standalone_node,learner_node,revision) to simulate the bug condition.It verifies the fix by checking for the
"peer active but not a voter"log message in thepacemaker log and confirming both members recover as voting members.
Test plan
"peer active but not a voter"log message appears in pacemaker log