Skip to content

OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite#30950

Draft
lucaconsalvi wants to merge 9 commits intoopenshift:mainfrom
lucaconsalvi:tnf-is-standalone-test
Draft

OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite#30950
lucaconsalvi wants to merge 9 commits intoopenshift:mainfrom
lucaconsalvi:tnf-is-standalone-test

Conversation

@lucaconsalvi
Copy link
Copy Markdown

Summary

Test

Test What it validates
should identify standalone voter correctly when peer is a learner with higher revision is_standalone() correctly identifies that a learner peer with a higher revision should not cause the voter to join as learner, preventing a two-learner deadlock

Details

When a node restarts after force_new_cluster recovery and finds one active peer, the
startup logic checks whether the peer is a learner (non-voter) via the learner_node CRM
attribute. Without the fix, the code does not distinguish learners from voters when
comparing revisions — a learner with a cached higher revision could trick the voter into
joining as a learner itself, creating a two-learner deadlock.

The test uses standby/unstandby to trigger force_new_cluster recovery, then spoofs CRM
attributes (standalone_node, learner_node, revision) to simulate the bug condition.
It verifies the fix by checking for the "peer active but not a voter" log message in the
pacemaker log and confirming both members recover as voting members.

Test plan

  • Validated on a live two-node fencing cluster with spoofed CRM attributes
  • Confirmed "peer active but not a voter" log message appears in pacemaker log
  • Confirmed both nodes recover as voting etcd members with no deadlock

Note: This PR depends on #30880 and must be merged after it. It adds a new test to
the tnf_etcd_disruption.go file introduced in that PR.

lucaconsalvi and others added 9 commits March 13, 2026 15:12
Introduces the openshift/two-node-regression suite with 5 regression tests
that validate podman-etcd resource agent behavior under disruptive conditions:

- OCP-88178: learner_node CRM attribute cleanup during stop/start
- OCP-88179: active resource count excludes stopping resources
- OCP-88180: simultaneous stop delay prevents WAL corruption
- OCP-88181: coordinated recovery after etcd container kill
- OCP-88213: attribute retry during force-new-cluster recovery

Also adds shared pacemaker/CRM utilities to utils/common.go and updates
the openshift/two-node suite qualifier to exclude regression tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tomatically skip these tests on SNO environments.
- Handle GetNodes error in AfterEach to prevent nil panic
- Scope verifyEtcdCloneStartedOnAllNodes to etcd-clone block only
- Add per-node pacemaker log baselines to prevent stale log matches
- Fail test on log retrieval errors instead of silently skipping
- Poll for learner_node attribute with Eventually for async monitor

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The podman kill command can fail if etcd is already stopped or
restarting. Append '; true' to tolerate non-zero exit codes since
the real assertion is whether the cluster recovers, not whether
the kill command succeeded.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…de suite

Move all 5 resilience tests from the separate tnf_resilience.go file
into the existing Describe block in tnf_recovery.go, eliminating the
need for a dedicated openshift/tnf-resilience suite. The tests now run
as part of the openshift/two-node suite alongside the existing recovery
tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move the 5 etcd disruption tests from tnf_recovery.go into
tnf_etcd_disruption.go with their own Describe, BeforeEach,
and AfterEach blocks. This ensures the disruption cleanup
only runs for disruption tests, not for recovery tests.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds a test that verifies the podman-etcd resource agent correctly
identifies when the active peer is a learner (non-voter) and starts
normally instead of joining as a learner itself, preventing a
two-learner deadlock.

The test uses standby/unstandby to trigger force_new_cluster recovery,
then spoofs CRM attributes (standalone_node, learner_node, revision)
to simulate a learner with a higher cached revision than the voter.
The AfterEach cleanup is extended to handle the new attributes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci-robot
Copy link
Copy Markdown

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 1, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73136f8a-9812-4b78-84e0-41d9c276af45

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 1, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@lucaconsalvi lucaconsalvi changed the title Tnf is standalone test OCPEDGE-2434: Add is_standalone learner check test for two-node etcd disruption suite Apr 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 1, 2026

@lucaconsalvi: This pull request references OCPEDGE-2434 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Test

Test What it validates
should identify standalone voter correctly when peer is a learner with higher revision is_standalone() correctly identifies that a learner peer with a higher revision should not cause the voter to join as learner, preventing a two-learner deadlock

Details

When a node restarts after force_new_cluster recovery and finds one active peer, the
startup logic checks whether the peer is a learner (non-voter) via the learner_node CRM
attribute. Without the fix, the code does not distinguish learners from voters when
comparing revisions — a learner with a cached higher revision could trick the voter into
joining as a learner itself, creating a two-learner deadlock.

The test uses standby/unstandby to trigger force_new_cluster recovery, then spoofs CRM
attributes (standalone_node, learner_node, revision) to simulate the bug condition.
It verifies the fix by checking for the "peer active but not a voter" log message in the
pacemaker log and confirming both members recover as voting members.

Test plan

  • Validated on a live two-node fencing cluster with spoofed CRM attributes
  • Confirmed "peer active but not a voter" log message appears in pacemaker log
  • Confirmed both nodes recover as voting etcd members with no deadlock

Note: This PR depends on #30880 and must be merged after it. It adds a new test to
the tnf_etcd_disruption.go file introduced in that PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 1, 2026
@lucaconsalvi lucaconsalvi changed the title OCPEDGE-2434: Add is_standalone learner check test for two-node etcd disruption suite OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite Apr 1, 2026
@openshift-ci-robot
Copy link
Copy Markdown

openshift-ci-robot commented Apr 1, 2026

@lucaconsalvi: This pull request references OCPEDGE-2436 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Test

Test What it validates
should identify standalone voter correctly when peer is a learner with higher revision is_standalone() correctly identifies that a learner peer with a higher revision should not cause the voter to join as learner, preventing a two-learner deadlock

Details

When a node restarts after force_new_cluster recovery and finds one active peer, the
startup logic checks whether the peer is a learner (non-voter) via the learner_node CRM
attribute. Without the fix, the code does not distinguish learners from voters when
comparing revisions — a learner with a cached higher revision could trick the voter into
joining as a learner itself, creating a two-learner deadlock.

The test uses standby/unstandby to trigger force_new_cluster recovery, then spoofs CRM
attributes (standalone_node, learner_node, revision) to simulate the bug condition.
It verifies the fix by checking for the "peer active but not a voter" log message in the
pacemaker log and confirming both members recover as voting members.

Test plan

  • Validated on a live two-node fencing cluster with spoofed CRM attributes
  • Confirmed "peer active but not a voter" log message appears in pacemaker log
  • Confirmed both nodes recover as voting etcd members with no deadlock

Note: This PR depends on #30880 and must be merged after it. It adds a new test to
the tnf_etcd_disruption.go file introduced in that PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci bot commented Apr 1, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lucaconsalvi
Once this PR has been reviewed and has the lgtm label, please assign jaypoulz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants