OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite by lucaconsalvi · Pull Request #30950 · openshift/origin

lucaconsalvi · 2026-04-01T16:25:17Z

Summary

Add test for is_standalone() learner detection logic in the podman-etcd resource agent
(OCPEDGE-2434)
Depends on OCPEDGE-2386: Add two-node-resilience test suite with 5 etcd resilience tests. #30880 — must be merged after it

Test

Test	What it validates
should identify standalone voter correctly when peer is a learner with higher revision	`is_standalone()` correctly identifies that a learner peer with a higher revision should not cause the voter to join as learner, preventing a two-learner deadlock

Details

When a node restarts after force_new_cluster recovery and finds one active peer, the
startup logic checks whether the peer is a learner (non-voter) via the learner_node CRM
attribute. Without the fix, the code does not distinguish learners from voters when
comparing revisions — a learner with a cached higher revision could trick the voter into
joining as a learner itself, creating a two-learner deadlock.

The test uses standby/unstandby to trigger force_new_cluster recovery, then spoofs CRM
attributes (standalone_node, learner_node, revision) to simulate the bug condition.
It verifies the fix by checking for the "peer active but not a voter" log message in the
pacemaker log and confirming both members recover as voting members.

Test plan

Validated on a live two-node fencing cluster with spoofed CRM attributes
Confirmed "peer active but not a voter" log message appears in pacemaker log
Confirmed both nodes recover as voting etcd members with no deadlock

Note: This PR depends on #30880 and must be merged after it. It adds a new test to
the tnf_etcd_disruption.go file introduced in that PR.

Introduces the openshift/two-node-regression suite with 5 regression tests that validate podman-etcd resource agent behavior under disruptive conditions: - OCP-88178: learner_node CRM attribute cleanup during stop/start - OCP-88179: active resource count excludes stopping resources - OCP-88180: simultaneous stop delay prevents WAL corruption - OCP-88181: coordinated recovery after etcd container kill - OCP-88213: attribute retry during force-new-cluster recovery Also adds shared pacemaker/CRM utilities to utils/common.go and updates the openshift/two-node suite qualifier to exclude regression tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…tomatically skip these tests on SNO environments.

- Handle GetNodes error in AfterEach to prevent nil panic - Scope verifyEtcdCloneStartedOnAllNodes to etcd-clone block only - Add per-node pacemaker log baselines to prevent stale log matches - Fail test on log retrieval errors instead of silently skipping - Poll for learner_node attribute with Eventually for async monitor Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The podman kill command can fail if etcd is already stopped or restarting. Append '; true' to tolerate non-zero exit codes since the real assertion is whether the cluster recovers, not whether the kill command succeeded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…de suite Move all 5 resilience tests from the separate tnf_resilience.go file into the existing Describe block in tnf_recovery.go, eliminating the need for a dedicated openshift/tnf-resilience suite. The tests now run as part of the openshift/two-node suite alongside the existing recovery tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Move the 5 etcd disruption tests from tnf_recovery.go into tnf_etcd_disruption.go with their own Describe, BeforeEach, and AfterEach blocks. This ensures the disruption cleanup only runs for disruption tests, not for recovery tests. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Adds a test that verifies the podman-etcd resource agent correctly identifies when the active peer is a learner (non-voter) and starts normally instead of joining as a learner itself, preventing a two-learner deadlock. The test uses standby/unstandby to trigger force_new_cluster recovery, then spoofs CRM attributes (standalone_node, learner_node, revision) to simulate a learner with a higher cached revision than the voter. The AfterEach cleanup is extended to handle the new attributes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

openshift-ci-robot · 2026-04-01T16:25:20Z

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

coderabbitai · 2026-04-01T16:25:25Z

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)

do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 73136f8a-9812-4b78-84e0-41d9c276af45

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

openshift-ci · 2026-04-01T16:25:38Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci-robot · 2026-04-01T16:26:02Z

@lucaconsalvi: This pull request references OCPEDGE-2434 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Add test for is_standalone() learner detection logic in the podman-etcd resource agent
(OCPEDGE-2434)

Depends on OCPEDGE-2386: Add two-node-resilience test suite with 5 etcd resilience tests. #30880 — must be merged after it

Test

Test What it validates

should identify standalone voter correctly when peer is a learner with higher revision is_standalone() correctly identifies that a learner peer with a higher revision should not cause the voter to join as learner, preventing a two-learner deadlock

Details

When a node restarts after force_new_cluster recovery and finds one active peer, the
startup logic checks whether the peer is a learner (non-voter) via the learner_node CRM
attribute. Without the fix, the code does not distinguish learners from voters when
comparing revisions — a learner with a cached higher revision could trick the voter into
joining as a learner itself, creating a two-learner deadlock.

The test uses standby/unstandby to trigger force_new_cluster recovery, then spoofs CRM
attributes (standalone_node, learner_node, revision) to simulate the bug condition.
It verifies the fix by checking for the "peer active but not a voter" log message in the
pacemaker log and confirming both members recover as voting members.

Test plan

Validated on a live two-node fencing cluster with spoofed CRM attributes

Confirmed "peer active but not a voter" log message appears in pacemaker log

Confirmed both nodes recover as voting etcd members with no deadlock

Note: This PR depends on #30880 and must be merged after it. It adds a new test to
the tnf_etcd_disruption.go file introduced in that PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci-robot · 2026-04-01T16:26:24Z

@lucaconsalvi: This pull request references OCPEDGE-2436 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the sub-task to target the "4.22.0" version, but no target version was set.

Details

In response to this:

Summary

Add test for is_standalone() learner detection logic in the podman-etcd resource agent
(OCPEDGE-2434)

Depends on OCPEDGE-2386: Add two-node-resilience test suite with 5 etcd resilience tests. #30880 — must be merged after it

Test

Test What it validates

should identify standalone voter correctly when peer is a learner with higher revision is_standalone() correctly identifies that a learner peer with a higher revision should not cause the voter to join as learner, preventing a two-learner deadlock

Details

When a node restarts after force_new_cluster recovery and finds one active peer, the
startup logic checks whether the peer is a learner (non-voter) via the learner_node CRM
attribute. Without the fix, the code does not distinguish learners from voters when
comparing revisions — a learner with a cached higher revision could trick the voter into
joining as a learner itself, creating a two-learner deadlock.

The test uses standby/unstandby to trigger force_new_cluster recovery, then spoofs CRM
attributes (standalone_node, learner_node, revision) to simulate the bug condition.
It verifies the fix by checking for the "peer active but not a voter" log message in the
pacemaker log and confirming both members recover as voting members.

Test plan

Validated on a live two-node fencing cluster with spoofed CRM attributes

Confirmed "peer active but not a voter" log message appears in pacemaker log

Confirmed both nodes recover as voting etcd members with no deadlock

Note: This PR depends on #30880 and must be merged after it. It adds a new test to
the tnf_etcd_disruption.go file introduced in that PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-ci · 2026-04-01T16:27:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: lucaconsalvi
Once this PR has been reviewed and has the lgtm label, please assign jaypoulz for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

test/extended/two_node/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

lucaconsalvi and others added 9 commits March 13, 2026 15:12

Rename suite from two-node-regression to tnf-resilience

00cf165

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Add [Skipped:SingleReplicaTopology] label to the Describe block to au…

778acd8

…tomatically skip these tests on SNO environments.

Solve formatting problems.

c4b91d7

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 1, 2026

lucaconsalvi changed the title ~~Tnf is standalone test~~ OCPEDGE-2434: Add is_standalone learner check test for two-node etcd disruption suite Apr 1, 2026

openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Apr 1, 2026

lucaconsalvi changed the title ~~OCPEDGE-2434: Add is_standalone learner check test for two-node etcd disruption suite~~ OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite Apr 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite#30950

OCPEDGE-2436: Add is_standalone learner check test for two-node etcd disruption suite#30950
lucaconsalvi wants to merge 9 commits intoopenshift:mainfrom
lucaconsalvi:tnf-is-standalone-test

lucaconsalvi commented Apr 1, 2026

Uh oh!

openshift-ci-robot commented Apr 1, 2026

Uh oh!

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

Review skipped

Uh oh!

openshift-ci bot commented Apr 1, 2026

Uh oh!

openshift-ci-robot commented Apr 1, 2026 •

edited by openshift-ci bot

Loading

Summary

Test

Details

Test plan

Uh oh!

openshift-ci-robot commented Apr 1, 2026 •

edited by openshift-ci bot

Loading

Summary

Test

Details

Test plan

Uh oh!

openshift-ci bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lucaconsalvi commented Apr 1, 2026

Summary

Test

Details

Test plan

Uh oh!

openshift-ci-robot commented Apr 1, 2026

Uh oh!

coderabbitai bot commented Apr 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

openshift-ci bot commented Apr 1, 2026

Uh oh!

openshift-ci-robot commented Apr 1, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test

Details

Test plan

Uh oh!

openshift-ci-robot commented Apr 1, 2026 • edited by openshift-ci bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test

Details

Test plan

Uh oh!

openshift-ci bot commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Apr 1, 2026 •

edited

Loading

openshift-ci-robot commented Apr 1, 2026 •

edited by openshift-ci bot

Loading

openshift-ci-robot commented Apr 1, 2026 •

edited by openshift-ci bot

Loading