Skip to content

OCPBUGS-77949 OCPBUGS-77948 OCPBUGS-78298: TNF node replacement test updates#30846

Open
jaypoulz wants to merge 4 commits intoopenshift:mainfrom
jaypoulz:tnf-node-replacement-fixes
Open

OCPBUGS-77949 OCPBUGS-77948 OCPBUGS-78298: TNF node replacement test updates#30846
jaypoulz wants to merge 4 commits intoopenshift:mainfrom
jaypoulz:tnf-node-replacement-fixes

Conversation

@jaypoulz
Copy link
Contributor

@jaypoulz jaypoulz commented Mar 6, 2026

  • Tightens up timeouts in core test loop
  • Fixes podman-etcd logging to feature human-readable output
  • Fixes a bug with IPv6 IP address formatting in URL

@openshift-ci-robot
Copy link

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 6, 2026
@openshift-ci-robot
Copy link

@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is invalid:

  • expected the bug to target the "4.22.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

  • Tightens up timeouts in core test loop
  • Fixes podman-etcd logging to feature human-readable output
  • Fixes a bug with IPv6 IP address formatting in URL

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai
Copy link

coderabbitai bot commented Mar 6, 2026

Important

Review skipped

Auto reviews are limited based on label configuration.

🚫 Review skipped — only excluded labels are configured. (1)
  • do-not-merge/work-in-progress

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c12cda83-b335-4741-8f05-c89205679a45

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Migrates two-node tests to dynamic API clients, parameterizes BareMetalHost Redfish authority, enhances node-replacement flows (CSR approval, readiness gating, finalizer/force-delete), adds job pod log dumping, pacemaker full-status debug, and UTF‑8‑aware SSH log truncation.

Changes

Cohort / File(s) Summary
Test data / templates
test/extended/testdata/two_node/baremetalhost-template.yaml, test/extended/testdata/bindata.go
Updated BareMetalHost BMC bmc.address to use {REDFISH_AUTHORITY}/redfish/v1/Systems/{UUID} (replaced IP:port form); regenerated bindata.
Node replacement test
test/extended/two_node/tnf_node_replacement.go
Large rewrite: migrated many oc-based CRUDs to dynamic API client GVRs, added CSR approval/waiting, operator/webhook readiness gating, force-delete with finalizer removal and webhook cleanup, parallelized update-setup flows, extended timeouts, and added diagnostics/logging.
APIs — BareMetalHost
test/extended/two_node/utils/apis/baremetalhost.go
Added BMHGVR, dynamic-client get/list helpers (getBMHDynamic) and converted BMH/secret lookups and state/error retrievals to API-based implementations.
APIs — Machine
test/extended/two_node/utils/apis/machine.go
New MachineGVR, MachineStatus type, and dynamic-client helpers: GetMachineStatus, MachineExists, GetMachineYAML.
APIs — CSR
test/extended/two_node/utils/apis/csr.go
Added NodeCSRUsernamePrefix and CSR utilities: LogNodeCSRStatus, HasApprovedNodeCSR, ApproveCSRs (monitoring/auto-approve with timeout).
Services — etcd / jobs
test/extended/two_node/utils/services/etcd.go
Added DumpJobPodLogs, label-based discovery for TNF update-setup jobs, node-targeted and survivor-aware WaitFor*UpdateSetupJobCompletion variants, and ensured pod logs are dumped after job completion.
Services — pacemaker
test/extended/two_node/utils/services/pacemaker.go
Added PcsStatusFullViaDebug; WaitForNodesOnline now tolerates transient errors during polling and logs full pacemaker status on timeout/failure.
Core SSH utilities
test/extended/two_node/utils/core/ssh.go
Introduced UTF‑8‑aware truncation (maxLogOutputBytes, truncateForLog) and applied truncation to stdout/stderr logging with byte counts when truncated.
Common utilities
test/extended/two_node/utils/common.go
Switched TryPacemakerCleanup to use the fuller PcsStatusFullViaDebug status retrieval.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Test Structure And Quality ⚠️ Warning The getUpdateSetupJobNameForNode function returns the first job from an unsorted LIST without sorting by CreationTimestamp, risking selection of stale jobs due to non-deterministic Kubernetes LIST ordering. Sort update-setup jobs by CreationTimestamp and select the newest job. Add meaningful assertion failure messages in utility functions for better test diagnostics.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly summarizes the main changes: TNF node replacement test updates addressing multiple bugs with specific Jira ticket references.
Docstring Coverage ✅ Passed Docstring coverage is 93.94% which is sufficient. The required threshold is 80.00%.
Stable And Deterministic Test Names ✅ Passed The PR modifies test implementation files and utilities. The only test title added/modified is stable and deterministic with no dynamic information.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

@jaypoulz
Copy link
Contributor Author

jaypoulz commented Mar 6, 2026

/jira refresh

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Mar 6, 2026
@openshift-ci-robot
Copy link

@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

Details

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot
Copy link

@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

Details

In response to this:

  • Tightens up timeouts in core test loop
  • Fixes podman-etcd logging to feature human-readable output
  • Fixes a bug with IPv6 IP address formatting in URL

Summary by CodeRabbit

Release Notes

  • Tests
  • Enhanced node replacement test reliability with improved concurrency and timeout handling for job waits
  • Strengthened resource cleanup logic with finalizer-based deletion support
  • Expanded debugging capabilities with verbose pacemaker status logging and extended error reporting
  • Improved SSH command output logging with automatic truncation to prevent excessive log sizes
  • Added automatic pod log capture for job completion monitoring

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/two_node/utils/core/ssh.go`:
- Around line 136-141: The current log truncation slices the UTF-8 string bytes
with out[:maxLogOutputBytes], which can cut a multi-byte rune and produce
invalid UTF-8 in logs; update the truncation logic around stdout.String()/out
and the e2e.Logf call to perform rune-safe truncation (for example, convert to
runes or iterate runes until adding the next rune would exceed
maxLogOutputBytes) and then log the safely truncated string along with the total
byte length using the existing maxLogOutputBytes and e2e.Logf call sites.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 3ecbedc9-3c0f-4263-b6e6-a6a05314b2c2

📥 Commits

Reviewing files that changed from the base of the PR and between 35bab74 and 547c9eb.

📒 Files selected for processing (6)
  • test/extended/testdata/two_node/baremetalhost-template.yaml
  • test/extended/two_node/tnf_node_replacement.go
  • test/extended/two_node/utils/common.go
  • test/extended/two_node/utils/core/ssh.go
  • test/extended/two_node/utils/services/etcd.go
  • test/extended/two_node/utils/services/pacemaker.go

Copy link
Contributor

@eggfoobar eggfoobar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great, just had some small suggestions.

@jaypoulz jaypoulz force-pushed the tnf-node-replacement-fixes branch from 547c9eb to 0816c76 Compare March 10, 2026 18:55
@jaypoulz
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 10, 2026

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c38d4d90-1cb2-11f1-8125-bec258656377-0

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1273-1280: The wait uses a hardcoded minPodCreationTime
(time.Now().Add(-2 * time.Minute)) which can include pre-Ready stale pods;
change waitForNodeRecovery to return the node Ready timestamp (e.g., readyTime)
and use that exact timestamp here instead of time.Now().Add(...), passing the
returned readyTime as minPodCreationTime into
services.WaitForSurvivorUpdateSetupJobCompletion (and the symmetric
WaitForTargetUpdateSetupJobCompletion) so the waits are gated on the node Ready
time rather than an approximate clock offset.
- Around line 1337-1354: The current attempt spawns oc.AsAdmin().Run("delete")
in a goroutine and uses time.After(deleteAttemptTimeout), which leaves the
delete running if the timer fires; replace that pattern with a per-attempt
cancelable context so each delete is actually bounded: inside the
RetryWithOptions callback create ctx, cancel :=
context.WithTimeout(context.Background(), deleteAttemptTimeout) and defer
cancel(), then invoke the delete command with that context (e.g.,
oc.AsAdmin().Run("delete").Args(resourceType, resourceName, "-n",
namespace).WithContext(ctx).Output() or the project’s equivalent Run/Output
method that accepts a context), remove the extra goroutine and select, capture
and log the returned error from the cancelable delete call, and then use
ocResourceExists(oc, resourceType, resourceName, namespace) to decide
success/failure as before.
- Around line 1466-1468: The call to core.RetryOptions in
waitForEtcdResourceToStop is ignoring the function's timeout parameter and
hardcodes threeMinuteTimeout, preventing callers from controlling the deadline;
change the RetryOptions Timeout to use the function's timeout argument (the
timeout parameter of waitForEtcdResourceToStop) instead of threeMinuteTimeout,
and ensure any associated log message that references the timeout reflects the
passed-in timeout value so logs match behavior.
- Around line 1378-1393: The function forceDeleteOcResourceByRemovingFinalizers
currently returns nil even when the confirm loop times out, causing callers to
assume deletion succeeded; change the final branch so that after the timeout
(where it currently logs the WARNING) the function returns a non-nil error
(e.g., fmt.Errorf with context including resourceType, resourceName and
forceDeleteConfirmTimeout) instead of nil so callers see the failure and can
handle the retry/error path; update the log call in
forceDeleteOcResourceByRemovingFinalizers to include the same error context when
returning.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fce9d0e8-a5c5-480a-b91a-17f85f92e721

📥 Commits

Reviewing files that changed from the base of the PR and between 547c9eb and 0816c76.

📒 Files selected for processing (6)
  • test/extended/testdata/two_node/baremetalhost-template.yaml
  • test/extended/two_node/tnf_node_replacement.go
  • test/extended/two_node/utils/common.go
  • test/extended/two_node/utils/core/ssh.go
  • test/extended/two_node/utils/services/etcd.go
  • test/extended/two_node/utils/services/pacemaker.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/testdata/two_node/baremetalhost-template.yaml

@jaypoulz jaypoulz changed the title OCPBUGS-77949: OCPBUGS-77948: TNF node replacement test updates OCPBUGS-77949 OCPBUGS-77948 OCPBUGS-78298: TNF node replacement test updates Mar 11, 2026
@openshift-ci-robot
Copy link

@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-77948, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

This pull request references Jira Issue OCPBUGS-78298, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

  • Tightens up timeouts in core test loop
  • Fixes podman-etcd logging to feature human-readable output
  • Fixes a bug with IPv6 IP address formatting in URL

Summary by CodeRabbit

  • Tests
  • Improved node-replacement reliability with longer, per-operation timeouts and parallelized waits
  • Enhanced cleanup and force-delete resilience for test resources
  • Added automatic pod log capture after job completion and safer SSH output truncation to limit log size
  • Expanded pacemaker/status debugging with fuller status dumps on failures
  • Updated test templates to use a Redfish authority-style BMC address format

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@jaypoulz
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 11, 2026

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/215be940-1d8a-11f1-85b1-c121c92478a0-0

@openshift-ci-robot
Copy link

@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

This pull request references Jira Issue OCPBUGS-77948, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

This pull request references Jira Issue OCPBUGS-78298, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.22.0) matches configured target version for branch (4.22.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request.

Details

In response to this:

  • Tightens up timeouts in core test loop
  • Fixes podman-etcd logging to feature human-readable output
  • Fixes a bug with IPv6 IP address formatting in URL

Summary by CodeRabbit

  • Tests
  • Improved node-replacement reliability with longer, per-operation timeouts and parallelized waits
  • Enhanced cleanup and force-delete resilience for test resources, including finalizer-based force-delete helpers
  • Added automatic pod log capture after job completion and new job-by-node wait helpers
  • Safer SSH output truncation for logs to limit size while preserving UTF-8 boundaries
  • Expanded pacemaker/status debugging with fuller status dumps on failures
  • Updated test templates to use a Redfish authority-style BMC address format

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
test/extended/two_node/tnf_node_replacement.go (1)

589-598: ⚠️ Potential issue | 🟡 Minor

Timeout documentation is inconsistent with implementation.

Line 589 says “overall 30-minute timeout”, but Line 597 sets 20 * time.Minute. Please align the comment/value to avoid misleading recovery logs.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_node_replacement.go` around lines 589 - 598, The
comment above recoverClusterFromBackup says "Has an overall 30-minute timeout"
but the implementation sets const recoveryTimeout = 20 * time.Minute; update
either the comment or the recoveryTimeout constant so they match (e.g., change
the comment to "20-minute timeout" or set recoveryTimeout = 30 * time.Minute)
and ensure the descriptive log/comment near recoverClusterFromBackup and the
recoveryTimeout constant stay consistent.
♻️ Duplicate comments (4)
test/extended/two_node/tnf_node_replacement.go (4)

1511-1513: ⚠️ Potential issue | 🟡 Minor

Honor the function timeout parameter.

Line 1512 hardcodes threeMinuteTimeout, so the timeout argument to waitForEtcdResourceToStop is ignored.

🛠️ Minimal fix
 	}, core.RetryOptions{
-		Timeout:      threeMinuteTimeout,
+		Timeout:      timeout,
 		PollInterval: utils.FiveSecondPollInterval,
 	}, fmt.Sprintf("etcd stop on %s", testConfig.SurvivingNode.Name))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_node_replacement.go` around lines 1511 - 1513, The
call to configure core.RetryOptions in waitForEtcdResourceToStop is ignoring the
function's timeout parameter by hardcoding threeMinuteTimeout; change the
RetryOptions.Timeout field to use the function's timeout parameter (named
timeout) instead of threeMinuteTimeout so the provided timeout is honored (keep
PollInterval as utils.FiveSecondPollInterval and preserve the surrounding call
in waitForEtcdResourceToStop).

1277-1292: ⚠️ Potential issue | 🟠 Major

Gate survivor job timing on actual Ready time, not a fixed offset.

Line 1277 uses time.Now().Add(-2 * time.Minute), which can still include stale pre-Ready runs or exclude valid runs depending on timing drift. Use the exact replacement-node Ready timestamp captured from waitForNodeRecovery.

🛠️ Minimal direction
- minPodCreationTime := time.Now().Add(-2 * time.Minute)
+ minPodCreationTime := replacementNodeReadyTime
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_node_replacement.go` around lines 1277 - 1292,
Replace the fixed minPodCreationTime (currently set to time.Now().Add(-2 *
time.Minute)) with the actual replacement-node Ready timestamp returned by
waitForNodeRecovery (use that Ready time as the min creation time); update the
variables passed into services.WaitForSurvivorUpdateSetupJobCompletionByNode and
services.WaitForUpdateSetupJobCompletionByNode to use that replacementReadyTime
(or equivalent field on testConfig.TargetNode) so both waits are gated on the
node's real Ready time rather than a hardcoded offset.

1411-1412: ⚠️ Potential issue | 🟠 Major

Do not return success while the resource still exists.

After force-delete confirmation times out, Line 1412 still returns nil, which lets callers proceed as if cleanup succeeded.

🛠️ Minimal fix
-	e2e.Logf("WARNING: %s %s still present after %v (patch was applied; it may disappear shortly)", resourceType, resourceName, forceDeleteConfirmTimeout)
-	return nil
+	return fmt.Errorf("%s %s still present after %v even after finalizer patch", resourceType, resourceName, forceDeleteConfirmTimeout)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_node_replacement.go` around lines 1411 - 1412, The
current code logs a warning when the resource still exists after
forceDeleteConfirmTimeout but then returns nil, signaling success; change the
behavior in the function where this occurs (referencing resourceType,
resourceName, forceDeleteConfirmTimeout and e2e.Logf) to return a non-nil error
instead of nil (e.g., a formatted error describing the resource still present)
so callers do not treat cleanup as successful; ensure fmt (or errors) is
imported and use fmt.Errorf to construct the error message.

1342-1354: ⚠️ Potential issue | 🟠 Major

deleteAttemptTimeout is not truly enforced per attempt.

When Line 1352 times out, the goroutine running oc delete keeps executing in the background. Retries can overlap and race each other.

#!/bin/bash
set -euo pipefail

# Verify whether exutil CLI supports context-bound command execution
fd client.go
rg -n -C3 'type CLI struct' test/extended/util/client.go
rg -n -C4 'func \(.*CLI.*\) Run\(' test/extended/util/client.go
rg -n -C4 'func \(.*CLI.*\) Output\(' test/extended/util/client.go
rg -n -C4 'WithContext|context\.Context' test/extended/util/client.go

If context-bound command execution is unavailable, use a client-go delete path with context timeout where possible.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_node_replacement.go` around lines 1342 - 1354, The
deleteAttemptTimeout isn't canceling the in-flight oc delete goroutine, so
retries can overlap; modify the deletion to be context-aware: create a context
with timeout (based on deleteAttemptTimeout) and use a context-bound API (either
pass ctx into oc.AsAdmin().Run(...).Output() if that method supports contexts,
or replace this path with a client-go delete call that accepts ctx) or execute
the command via an exec path that supports CommandContext so the process is
killed when the context times out; ensure the goroutine returns on context
cancellation and send the final error into the done channel only when the ctx
isn't canceled to avoid races between overlapping attempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/two_node/utils/services/etcd.go`:
- Around line 356-369: The function getUpdateSetupJobNameForNode currently
returns the first job matching nodeName from the list (label selector
tnfUpdateSetupJobLabelSelector), which is nondeterministic; instead, filter
list.Items for Spec.Template.Spec.NodeName == nodeName, then choose the item
with the largest CreationTimestamp (newest) and return its Name; update the
function to iterate to collect matches, compare metav1.Time (or
.CreationTimestamp) to pick the latest job, and return that job's name (or
""/error if none).

---

Outside diff comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 589-598: The comment above recoverClusterFromBackup says "Has an
overall 30-minute timeout" but the implementation sets const recoveryTimeout =
20 * time.Minute; update either the comment or the recoveryTimeout constant so
they match (e.g., change the comment to "20-minute timeout" or set
recoveryTimeout = 30 * time.Minute) and ensure the descriptive log/comment near
recoverClusterFromBackup and the recoveryTimeout constant stay consistent.

---

Duplicate comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1511-1513: The call to configure core.RetryOptions in
waitForEtcdResourceToStop is ignoring the function's timeout parameter by
hardcoding threeMinuteTimeout; change the RetryOptions.Timeout field to use the
function's timeout parameter (named timeout) instead of threeMinuteTimeout so
the provided timeout is honored (keep PollInterval as
utils.FiveSecondPollInterval and preserve the surrounding call in
waitForEtcdResourceToStop).
- Around line 1277-1292: Replace the fixed minPodCreationTime (currently set to
time.Now().Add(-2 * time.Minute)) with the actual replacement-node Ready
timestamp returned by waitForNodeRecovery (use that Ready time as the min
creation time); update the variables passed into
services.WaitForSurvivorUpdateSetupJobCompletionByNode and
services.WaitForUpdateSetupJobCompletionByNode to use that replacementReadyTime
(or equivalent field on testConfig.TargetNode) so both waits are gated on the
node's real Ready time rather than a hardcoded offset.
- Around line 1411-1412: The current code logs a warning when the resource still
exists after forceDeleteConfirmTimeout but then returns nil, signaling success;
change the behavior in the function where this occurs (referencing resourceType,
resourceName, forceDeleteConfirmTimeout and e2e.Logf) to return a non-nil error
instead of nil (e.g., a formatted error describing the resource still present)
so callers do not treat cleanup as successful; ensure fmt (or errors) is
imported and use fmt.Errorf to construct the error message.
- Around line 1342-1354: The deleteAttemptTimeout isn't canceling the in-flight
oc delete goroutine, so retries can overlap; modify the deletion to be
context-aware: create a context with timeout (based on deleteAttemptTimeout) and
use a context-bound API (either pass ctx into oc.AsAdmin().Run(...).Output() if
that method supports contexts, or replace this path with a client-go delete call
that accepts ctx) or execute the command via an exec path that supports
CommandContext so the process is killed when the context times out; ensure the
goroutine returns on context cancellation and send the final error into the done
channel only when the ctx isn't canceled to avoid races between overlapping
attempts.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 01a5ae69-4452-435c-803f-8baa3fa28e0a

📥 Commits

Reviewing files that changed from the base of the PR and between 0816c76 and 5cba2a7.

📒 Files selected for processing (2)
  • test/extended/two_node/tnf_node_replacement.go
  • test/extended/two_node/utils/services/etcd.go

@jaypoulz jaypoulz force-pushed the tnf-node-replacement-fixes branch from 5cba2a7 to f1c410d Compare March 13, 2026 17:52
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
test/extended/two_node/tnf_node_replacement.go (3)

1443-1454: ⚠️ Potential issue | 🟠 Major

Use the actual Ready timestamp here.

time.Now().Add(-2 * time.Minute) can still admit a pre-Ready survivor job or miss a valid run that started immediately after the node became Ready. Plumb the exact Ready transition time out of waitForNodeRecovery and use that for minPodCreationTime.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_node_replacement.go` around lines 1443 - 1454,
Replace the hardcoded minPodCreationTime (time.Now().Add(-2 * time.Minute)) with
the actual node Ready transition timestamp returned by waitForNodeRecovery:
modify waitForNodeRecovery to return the Ready time (e.g., readyTime time.Time),
propagate that return value to where minPodCreationTime is set, and pass that
readyTime into services.WaitForSurvivorUpdateSetupJobCompletionByNode (and any
other callers) so the poll uses the exact Ready timestamp instead of an
approximate Now()-based value.

1733-1735: ⚠️ Potential issue | 🟡 Minor

Honor the caller-provided timeout.

This function logs the timeout argument but still hardcodes threeMinuteTimeout, so callers cannot actually control the deadline.

Suggested fix
 	}, core.RetryOptions{
-		Timeout:      threeMinuteTimeout,
+		Timeout:      timeout,
 		PollInterval: utils.FiveSecondPollInterval,
 	}, fmt.Sprintf("etcd stop on %s", testConfig.SurvivingNode.Name))
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_node_replacement.go` around lines 1733 - 1735, The
RetryOptions block is ignoring the caller-provided timeout by hardcoding
threeMinuteTimeout; replace the hardcoded value with the function's timeout
parameter (ensure the parameter named timeout is used for RetryOptions.Timeout
and its type matches time.Duration) in the RetryOptions passed where
RetryOptions{ Timeout: threeMinuteTimeout, PollInterval:
utils.FiveSecondPollInterval } is set so callers can control the deadline (also
remove any misleading log text if it still references a different value).

1571-1581: ⚠️ Potential issue | 🟠 Major

Don't return success while the old object is still present.

If this confirm loop expires, callers continue into recreate logic even though the old BMH/Machine may still exist. Return an error here so the retry/error path handles it instead of racing the stale object.

Suggested fix
 	for time.Now().Before(deadline) {
 		if !resourceExists(ctx, dyn, gvr, resourceName, namespace) {
 			e2e.Logf("Resource %s %s confirmed gone", resourceType, resourceName)
 			return nil
 		}
 		time.Sleep(forceDeleteConfirmInterval)
 	}
-	e2e.Logf("WARNING: %s %s still present after %v (patch was applied; it may disappear shortly)", resourceType, resourceName, forceDeleteConfirmTimeout)
-	return nil
+	err = fmt.Errorf("%s %s still present after %v even after finalizer patch", resourceType, resourceName, forceDeleteConfirmTimeout)
+	e2e.Logf("WARNING: %v", err)
+	return err
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/tnf_node_replacement.go` around lines 1571 - 1581, The
confirm loop currently returns nil even when the forced-delete confirmation
timeout expires, causing callers to proceed while the old object may still
exist; modify the code in the function containing the loop to return a non-nil
error when resourceExists never becomes false by the deadline (use the same
context and include resourceType, resourceName, namespace and
forceDeleteConfirmTimeout in the error message for actionable logging), keeping
the successful-path return nil when resourceExists becomes false; reference the
resourceExists helper and the
forceDeleteConfirmTimeout/forceDeleteConfirmInterval constants to locate and
update the logic.
test/extended/two_node/utils/services/etcd.go (1)

356-369: ⚠️ Potential issue | 🟠 Major

Return the newest matching update-setup job.

Kubernetes LIST order is not stable. Returning the first node match can bind these waits to a stale hashed job and make the test watch the wrong run.

Suggested fix
 func getUpdateSetupJobNameForNode(oc *exutil.CLI, namespace, nodeName string) (string, error) {
 	list, err := oc.AdminKubeClient().BatchV1().Jobs(namespace).List(context.Background(), metav1.ListOptions{
 		LabelSelector: tnfUpdateSetupJobLabelSelector,
 	})
 	if err != nil {
 		return "", err
 	}
-	for i := range list.Items {
-		if list.Items[i].Spec.Template.Spec.NodeName == nodeName {
-			return list.Items[i].Name, nil
-		}
-	}
-	return "", nil
+	var newest *batchv1.Job
+	for i := range list.Items {
+		job := &list.Items[i]
+		if job.Spec.Template.Spec.NodeName != nodeName {
+			continue
+		}
+		if newest == nil || job.CreationTimestamp.Time.After(newest.CreationTimestamp.Time) {
+			newest = job
+		}
+	}
+	if newest == nil {
+		return "", nil
+	}
+	return newest.Name, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@test/extended/two_node/utils/services/etcd.go` around lines 356 - 369, The
function getUpdateSetupJobNameForNode currently returns the first Job whose Pod
template NodeName matches, which is unstable; instead iterate all list.Items and
pick the newest matching Job by comparing each item's
ObjectMeta.CreationTimestamp (or .GetCreationTimestamp()) and keep the item with
the latest timestamp, then return that item's Name (still return "", nil if
none); update the loop in getUpdateSetupJobNameForNode to track and return the
most recent matching job rather than the first match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1393-1405: waitForCSRApproved currently relies on
HasApprovedNodeCSR which only filters by "system:node:<name>" and can match old
approved CSRs; change waitForCSRApproved to require a fresh CSR: list CSRs for
the node (instead of calling apis.HasApprovedNodeCSR), filter by the node
subject (system:node:<name>) and by creationTimestamp being after the
node-replacement start time (add/use a timestamp field on TNFTestConfig, e.g.
ReplacementStartTime or set a local start := time.Now() when initiating
replacement), then check that at least one of those recent CSRs has been
approved; keep apis.LogNodeCSRStatus for debugging but base the success
condition on an approved CSR whose creationTimestamp is >= the replacement start
time.
- Around line 834-876: The recoverBMHAndMachineFromBackup function currently
recreates only the BMC secret and the Machine; add the mirror logic to recreate
the BareMetalHost (BMH) from backup as well: after recreateBMCSecret and
before/alongside the Machine restore, check whether the BMH already exists
(using an apis.BareMetalHostExists helper or dynamic GET against
apis.BareMetalHostGVR) and if not, read the BMH YAML from the backup directory
(create a bmhFile variable similar to machineFile, e.g.,
testConfig.TargetNode.BMHName+"-bmh.yaml"), decode into an
unstructured.Unstructured, clear resourceVersion and UID, then create it via
dyn.Resource(apis.BareMetalHostGVR).Namespace(machineAPINamespace).Create with
the same core.RetryWithOptions retry pattern used for the Machine; skip creation
if it already exists and log success after recreation.

---

Duplicate comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1443-1454: Replace the hardcoded minPodCreationTime
(time.Now().Add(-2 * time.Minute)) with the actual node Ready transition
timestamp returned by waitForNodeRecovery: modify waitForNodeRecovery to return
the Ready time (e.g., readyTime time.Time), propagate that return value to where
minPodCreationTime is set, and pass that readyTime into
services.WaitForSurvivorUpdateSetupJobCompletionByNode (and any other callers)
so the poll uses the exact Ready timestamp instead of an approximate Now()-based
value.
- Around line 1733-1735: The RetryOptions block is ignoring the caller-provided
timeout by hardcoding threeMinuteTimeout; replace the hardcoded value with the
function's timeout parameter (ensure the parameter named timeout is used for
RetryOptions.Timeout and its type matches time.Duration) in the RetryOptions
passed where RetryOptions{ Timeout: threeMinuteTimeout, PollInterval:
utils.FiveSecondPollInterval } is set so callers can control the deadline (also
remove any misleading log text if it still references a different value).
- Around line 1571-1581: The confirm loop currently returns nil even when the
forced-delete confirmation timeout expires, causing callers to proceed while the
old object may still exist; modify the code in the function containing the loop
to return a non-nil error when resourceExists never becomes false by the
deadline (use the same context and include resourceType, resourceName, namespace
and forceDeleteConfirmTimeout in the error message for actionable logging),
keeping the successful-path return nil when resourceExists becomes false;
reference the resourceExists helper and the
forceDeleteConfirmTimeout/forceDeleteConfirmInterval constants to locate and
update the logic.

In `@test/extended/two_node/utils/services/etcd.go`:
- Around line 356-369: The function getUpdateSetupJobNameForNode currently
returns the first Job whose Pod template NodeName matches, which is unstable;
instead iterate all list.Items and pick the newest matching Job by comparing
each item's ObjectMeta.CreationTimestamp (or .GetCreationTimestamp()) and keep
the item with the latest timestamp, then return that item's Name (still return
"", nil if none); update the loop in getUpdateSetupJobNameForNode to track and
return the most recent matching job rather than the first match.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 5873f4be-15c6-413b-ad57-72f475981dab

📥 Commits

Reviewing files that changed from the base of the PR and between 5cba2a7 and f1c410d.

📒 Files selected for processing (10)
  • test/extended/testdata/bindata.go
  • test/extended/testdata/two_node/baremetalhost-template.yaml
  • test/extended/two_node/tnf_node_replacement.go
  • test/extended/two_node/utils/apis/baremetalhost.go
  • test/extended/two_node/utils/apis/csr.go
  • test/extended/two_node/utils/apis/machine.go
  • test/extended/two_node/utils/common.go
  • test/extended/two_node/utils/core/ssh.go
  • test/extended/two_node/utils/services/etcd.go
  • test/extended/two_node/utils/services/pacemaker.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • test/extended/two_node/utils/core/ssh.go

@jaypoulz jaypoulz force-pushed the tnf-node-replacement-fixes branch from f1c410d to 6d7286a Compare March 13, 2026 18:17
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1557-1560: resourceExists currently treats any GET error as "not
found" which hides transient failures; update callers (where Delete uses
deleteAttemptTimeout and where polling uses forceDeleteConfirmTimeout) to pass a
timeout-bounded context (use context.WithTimeout with the respective
deleteAttemptTimeout / forceDeleteConfirmTimeout) into resourceExists, and also
harden resourceExists itself: detect k8s NotFound errors (apierrors.IsNotFound)
and return false, but for other errors log the error (use the test logger or
klog) and conservatively return true (meaning "still exists/unknown") so
transient network/API errors don't masquerade as successful deletes. Ensure you
reference the resourceExists(ctx, dyn, gvr, name, namespace) function and the
deleteAttemptTimeout / forceDeleteConfirmTimeout timeouts when making these
changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dfeefd16-08b3-49b0-9bd0-925d78caa090

📥 Commits

Reviewing files that changed from the base of the PR and between f1c410d and 6d7286a.

📒 Files selected for processing (5)
  • test/extended/two_node/tnf_node_replacement.go
  • test/extended/two_node/utils/apis/baremetalhost.go
  • test/extended/two_node/utils/apis/csr.go
  • test/extended/two_node/utils/apis/machine.go
  • test/extended/two_node/utils/services/etcd.go

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-fencing

…nHostPort for IPv6

- Use net.JoinHostPort(RedfishIP, port) so IPv6 addresses are bracketed (RFC 3986)
- BMH template placeholder {REDFISH_HOST_PORT} replaces {REDFISH_IP}; port 8000 in code

Made-with: Cursor
@jaypoulz jaypoulz force-pushed the tnf-node-replacement-fixes branch from 6d7286a to 92409ab Compare March 13, 2026 19:06
@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-fencing

@jaypoulz jaypoulz force-pushed the tnf-node-replacement-fixes branch from 92409ab to 00438bd Compare March 13, 2026 19:46
@jaypoulz
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-dualstack-recovery-techpreview
periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

@jaypoulz: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-dualstack-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b0916950-1f15-11f1-803b-2ea3c92c5131-0

@jaypoulz
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6a8db750-1f16-11f1-8bf3-ab20a3b78e4b-0

@openshift-ci-robot
Copy link

Scheduling required tests:
/test e2e-aws-csi
/test e2e-aws-ovn-fips
/test e2e-aws-ovn-microshift
/test e2e-aws-ovn-microshift-serial
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-gcp-csi
/test e2e-gcp-ovn
/test e2e-gcp-ovn-upgrade
/test e2e-metal-ipi-ovn-ipv6
/test e2e-vsphere-ovn
/test e2e-vsphere-ovn-upi

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-metal-ovn-two-node-fencing

@jaypoulz jaypoulz force-pushed the tnf-node-replacement-fixes branch from 00438bd to 21e117c Compare March 13, 2026 20:35
@jaypoulz
Copy link
Contributor Author

/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-dualstack-recovery-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

@jaypoulz: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-dualstack-recovery-techpreview
  • periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f9792ed0-1f1c-11f1-84a5-56e1af9843b4-0

…H/Machine when stuck

- Wait for baremetal-operator deployment ready before starting BMH/Machine deletes
- Discover BMO deployment by trying metal3-baremetal-operator (metal3/dev-scripts) then baremetal-operator (standard OCP)
- deleteOcResourceWithRetry: per-attempt timeout, then force-delete by patching finalizers
- When force-deleting BMH and webhook is unavailable, remove webhook config and retry

Made-with: Cursor
- Discover update-setup jobs by node name (label + NodeName) instead of exact job name
- WaitForUpdateSetupJobCompletionByNode, WaitForSurvivorUpdateSetupJobCompletionByNode in etcd.go
- restorePacemakerCluster uses ByNode so test works when job names include hash suffix

Made-with: Cursor
@jaypoulz jaypoulz force-pushed the tnf-node-replacement-fixes branch from 21e117c to f3045ae Compare March 13, 2026 20:55
…ry, fresh CSR, delete-confirm

- Observability improvements; prefer cluster APIs; delete order; BMO/webhook readiness
- recoverBMHAndMachineFromBackup: recreate BareMetalHost from backup; apis.BareMetalHostExists
- waitForCSRApproved: require CSR after ReplacementStartTime; apis.HasApprovedFreshNodeCSR
- resourceExists: only NotFound = absent; timeout-bounded context; log other errors
- TestExecution.ReplacementStartTime; timeout renames; recovery path cleanup

Made-with: Cursor
@jaypoulz jaypoulz force-pushed the tnf-node-replacement-fixes branch from f3045ae to efe8d75 Compare March 13, 2026 21:01
@xueqzhan
Copy link
Contributor

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Mar 13, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jaypoulz, xueqzhan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants