OCPBUGS-77949 OCPBUGS-77948 OCPBUGS-78298: TNF node replacement test updates#30846
OCPBUGS-77949 OCPBUGS-77948 OCPBUGS-78298: TNF node replacement test updates#30846jaypoulz wants to merge 4 commits intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Important Review skippedAuto reviews are limited based on label configuration. 🚫 Review skipped — only excluded labels are configured. (1)
Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughMigrates two-node tests to dynamic API clients, parameterizes BareMetalHost Redfish authority, enhances node-replacement flows (CSR approval, readiness gating, finalizer/force-delete), adds job pod log dumping, pacemaker full-status debug, and UTF‑8‑aware SSH log truncation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
|
/jira refresh |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/utils/core/ssh.go`:
- Around line 136-141: The current log truncation slices the UTF-8 string bytes
with out[:maxLogOutputBytes], which can cut a multi-byte rune and produce
invalid UTF-8 in logs; update the truncation logic around stdout.String()/out
and the e2e.Logf call to perform rune-safe truncation (for example, convert to
runes or iterate runes until adding the next rune would exceed
maxLogOutputBytes) and then log the safely truncated string along with the total
byte length using the existing maxLogOutputBytes and e2e.Logf call sites.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 3ecbedc9-3c0f-4263-b6e6-a6a05314b2c2
📒 Files selected for processing (6)
test/extended/testdata/two_node/baremetalhost-template.yamltest/extended/two_node/tnf_node_replacement.gotest/extended/two_node/utils/common.gotest/extended/two_node/utils/core/ssh.gotest/extended/two_node/utils/services/etcd.gotest/extended/two_node/utils/services/pacemaker.go
eggfoobar
left a comment
There was a problem hiding this comment.
Looking great, just had some small suggestions.
547c9eb to
0816c76
Compare
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/c38d4d90-1cb2-11f1-8125-bec258656377-0 |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1273-1280: The wait uses a hardcoded minPodCreationTime
(time.Now().Add(-2 * time.Minute)) which can include pre-Ready stale pods;
change waitForNodeRecovery to return the node Ready timestamp (e.g., readyTime)
and use that exact timestamp here instead of time.Now().Add(...), passing the
returned readyTime as minPodCreationTime into
services.WaitForSurvivorUpdateSetupJobCompletion (and the symmetric
WaitForTargetUpdateSetupJobCompletion) so the waits are gated on the node Ready
time rather than an approximate clock offset.
- Around line 1337-1354: The current attempt spawns oc.AsAdmin().Run("delete")
in a goroutine and uses time.After(deleteAttemptTimeout), which leaves the
delete running if the timer fires; replace that pattern with a per-attempt
cancelable context so each delete is actually bounded: inside the
RetryWithOptions callback create ctx, cancel :=
context.WithTimeout(context.Background(), deleteAttemptTimeout) and defer
cancel(), then invoke the delete command with that context (e.g.,
oc.AsAdmin().Run("delete").Args(resourceType, resourceName, "-n",
namespace).WithContext(ctx).Output() or the project’s equivalent Run/Output
method that accepts a context), remove the extra goroutine and select, capture
and log the returned error from the cancelable delete call, and then use
ocResourceExists(oc, resourceType, resourceName, namespace) to decide
success/failure as before.
- Around line 1466-1468: The call to core.RetryOptions in
waitForEtcdResourceToStop is ignoring the function's timeout parameter and
hardcodes threeMinuteTimeout, preventing callers from controlling the deadline;
change the RetryOptions Timeout to use the function's timeout argument (the
timeout parameter of waitForEtcdResourceToStop) instead of threeMinuteTimeout,
and ensure any associated log message that references the timeout reflects the
passed-in timeout value so logs match behavior.
- Around line 1378-1393: The function forceDeleteOcResourceByRemovingFinalizers
currently returns nil even when the confirm loop times out, causing callers to
assume deletion succeeded; change the final branch so that after the timeout
(where it currently logs the WARNING) the function returns a non-nil error
(e.g., fmt.Errorf with context including resourceType, resourceName and
forceDeleteConfirmTimeout) instead of nil so callers see the failure and can
handle the retry/error path; update the log call in
forceDeleteOcResourceByRemovingFinalizers to include the same error context when
returning.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fce9d0e8-a5c5-480a-b91a-17f85f92e721
📒 Files selected for processing (6)
test/extended/testdata/two_node/baremetalhost-template.yamltest/extended/two_node/tnf_node_replacement.gotest/extended/two_node/utils/common.gotest/extended/two_node/utils/core/ssh.gotest/extended/two_node/utils/services/etcd.gotest/extended/two_node/utils/services/pacemaker.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/extended/testdata/two_node/baremetalhost-template.yaml
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. This pull request references Jira Issue OCPBUGS-77948, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. This pull request references Jira Issue OCPBUGS-78298, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. The bug has been updated to refer to the pull request using the external bug tracker. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview |
|
@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/215be940-1d8a-11f1-85b1-c121c92478a0-0 |
|
@jaypoulz: This pull request references Jira Issue OCPBUGS-77949, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. This pull request references Jira Issue OCPBUGS-77948, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. This pull request references Jira Issue OCPBUGS-78298, which is valid. 3 validation(s) were run on this bug
No GitHub users were found matching the public email listed for the QA contact in Jira (dhensel@redhat.com), skipping review request. DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
test/extended/two_node/tnf_node_replacement.go (1)
589-598:⚠️ Potential issue | 🟡 MinorTimeout documentation is inconsistent with implementation.
Line 589 says “overall 30-minute timeout”, but Line 597 sets
20 * time.Minute. Please align the comment/value to avoid misleading recovery logs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_node_replacement.go` around lines 589 - 598, The comment above recoverClusterFromBackup says "Has an overall 30-minute timeout" but the implementation sets const recoveryTimeout = 20 * time.Minute; update either the comment or the recoveryTimeout constant so they match (e.g., change the comment to "20-minute timeout" or set recoveryTimeout = 30 * time.Minute) and ensure the descriptive log/comment near recoverClusterFromBackup and the recoveryTimeout constant stay consistent.
♻️ Duplicate comments (4)
test/extended/two_node/tnf_node_replacement.go (4)
1511-1513:⚠️ Potential issue | 🟡 MinorHonor the function
timeoutparameter.Line 1512 hardcodes
threeMinuteTimeout, so thetimeoutargument towaitForEtcdResourceToStopis ignored.🛠️ Minimal fix
}, core.RetryOptions{ - Timeout: threeMinuteTimeout, + Timeout: timeout, PollInterval: utils.FiveSecondPollInterval, }, fmt.Sprintf("etcd stop on %s", testConfig.SurvivingNode.Name))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_node_replacement.go` around lines 1511 - 1513, The call to configure core.RetryOptions in waitForEtcdResourceToStop is ignoring the function's timeout parameter by hardcoding threeMinuteTimeout; change the RetryOptions.Timeout field to use the function's timeout parameter (named timeout) instead of threeMinuteTimeout so the provided timeout is honored (keep PollInterval as utils.FiveSecondPollInterval and preserve the surrounding call in waitForEtcdResourceToStop).
1277-1292:⚠️ Potential issue | 🟠 MajorGate survivor job timing on actual Ready time, not a fixed offset.
Line 1277 uses
time.Now().Add(-2 * time.Minute), which can still include stale pre-Ready runs or exclude valid runs depending on timing drift. Use the exact replacement-node Ready timestamp captured fromwaitForNodeRecovery.🛠️ Minimal direction
- minPodCreationTime := time.Now().Add(-2 * time.Minute) + minPodCreationTime := replacementNodeReadyTime🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_node_replacement.go` around lines 1277 - 1292, Replace the fixed minPodCreationTime (currently set to time.Now().Add(-2 * time.Minute)) with the actual replacement-node Ready timestamp returned by waitForNodeRecovery (use that Ready time as the min creation time); update the variables passed into services.WaitForSurvivorUpdateSetupJobCompletionByNode and services.WaitForUpdateSetupJobCompletionByNode to use that replacementReadyTime (or equivalent field on testConfig.TargetNode) so both waits are gated on the node's real Ready time rather than a hardcoded offset.
1411-1412:⚠️ Potential issue | 🟠 MajorDo not return success while the resource still exists.
After force-delete confirmation times out, Line 1412 still returns
nil, which lets callers proceed as if cleanup succeeded.🛠️ Minimal fix
- e2e.Logf("WARNING: %s %s still present after %v (patch was applied; it may disappear shortly)", resourceType, resourceName, forceDeleteConfirmTimeout) - return nil + return fmt.Errorf("%s %s still present after %v even after finalizer patch", resourceType, resourceName, forceDeleteConfirmTimeout)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_node_replacement.go` around lines 1411 - 1412, The current code logs a warning when the resource still exists after forceDeleteConfirmTimeout but then returns nil, signaling success; change the behavior in the function where this occurs (referencing resourceType, resourceName, forceDeleteConfirmTimeout and e2e.Logf) to return a non-nil error instead of nil (e.g., a formatted error describing the resource still present) so callers do not treat cleanup as successful; ensure fmt (or errors) is imported and use fmt.Errorf to construct the error message.
1342-1354:⚠️ Potential issue | 🟠 Major
deleteAttemptTimeoutis not truly enforced per attempt.When Line 1352 times out, the goroutine running
oc deletekeeps executing in the background. Retries can overlap and race each other.#!/bin/bash set -euo pipefail # Verify whether exutil CLI supports context-bound command execution fd client.go rg -n -C3 'type CLI struct' test/extended/util/client.go rg -n -C4 'func \(.*CLI.*\) Run\(' test/extended/util/client.go rg -n -C4 'func \(.*CLI.*\) Output\(' test/extended/util/client.go rg -n -C4 'WithContext|context\.Context' test/extended/util/client.goIf context-bound command execution is unavailable, use a client-go delete path with context timeout where possible.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_node_replacement.go` around lines 1342 - 1354, The deleteAttemptTimeout isn't canceling the in-flight oc delete goroutine, so retries can overlap; modify the deletion to be context-aware: create a context with timeout (based on deleteAttemptTimeout) and use a context-bound API (either pass ctx into oc.AsAdmin().Run(...).Output() if that method supports contexts, or replace this path with a client-go delete call that accepts ctx) or execute the command via an exec path that supports CommandContext so the process is killed when the context times out; ensure the goroutine returns on context cancellation and send the final error into the done channel only when the ctx isn't canceled to avoid races between overlapping attempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/utils/services/etcd.go`:
- Around line 356-369: The function getUpdateSetupJobNameForNode currently
returns the first job matching nodeName from the list (label selector
tnfUpdateSetupJobLabelSelector), which is nondeterministic; instead, filter
list.Items for Spec.Template.Spec.NodeName == nodeName, then choose the item
with the largest CreationTimestamp (newest) and return its Name; update the
function to iterate to collect matches, compare metav1.Time (or
.CreationTimestamp) to pick the latest job, and return that job's name (or
""/error if none).
---
Outside diff comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 589-598: The comment above recoverClusterFromBackup says "Has an
overall 30-minute timeout" but the implementation sets const recoveryTimeout =
20 * time.Minute; update either the comment or the recoveryTimeout constant so
they match (e.g., change the comment to "20-minute timeout" or set
recoveryTimeout = 30 * time.Minute) and ensure the descriptive log/comment near
recoverClusterFromBackup and the recoveryTimeout constant stay consistent.
---
Duplicate comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1511-1513: The call to configure core.RetryOptions in
waitForEtcdResourceToStop is ignoring the function's timeout parameter by
hardcoding threeMinuteTimeout; change the RetryOptions.Timeout field to use the
function's timeout parameter (named timeout) instead of threeMinuteTimeout so
the provided timeout is honored (keep PollInterval as
utils.FiveSecondPollInterval and preserve the surrounding call in
waitForEtcdResourceToStop).
- Around line 1277-1292: Replace the fixed minPodCreationTime (currently set to
time.Now().Add(-2 * time.Minute)) with the actual replacement-node Ready
timestamp returned by waitForNodeRecovery (use that Ready time as the min
creation time); update the variables passed into
services.WaitForSurvivorUpdateSetupJobCompletionByNode and
services.WaitForUpdateSetupJobCompletionByNode to use that replacementReadyTime
(or equivalent field on testConfig.TargetNode) so both waits are gated on the
node's real Ready time rather than a hardcoded offset.
- Around line 1411-1412: The current code logs a warning when the resource still
exists after forceDeleteConfirmTimeout but then returns nil, signaling success;
change the behavior in the function where this occurs (referencing resourceType,
resourceName, forceDeleteConfirmTimeout and e2e.Logf) to return a non-nil error
instead of nil (e.g., a formatted error describing the resource still present)
so callers do not treat cleanup as successful; ensure fmt (or errors) is
imported and use fmt.Errorf to construct the error message.
- Around line 1342-1354: The deleteAttemptTimeout isn't canceling the in-flight
oc delete goroutine, so retries can overlap; modify the deletion to be
context-aware: create a context with timeout (based on deleteAttemptTimeout) and
use a context-bound API (either pass ctx into oc.AsAdmin().Run(...).Output() if
that method supports contexts, or replace this path with a client-go delete call
that accepts ctx) or execute the command via an exec path that supports
CommandContext so the process is killed when the context times out; ensure the
goroutine returns on context cancellation and send the final error into the done
channel only when the ctx isn't canceled to avoid races between overlapping
attempts.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 01a5ae69-4452-435c-803f-8baa3fa28e0a
📒 Files selected for processing (2)
test/extended/two_node/tnf_node_replacement.gotest/extended/two_node/utils/services/etcd.go
5cba2a7 to
f1c410d
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
test/extended/two_node/tnf_node_replacement.go (3)
1443-1454:⚠️ Potential issue | 🟠 MajorUse the actual Ready timestamp here.
time.Now().Add(-2 * time.Minute)can still admit a pre-Ready survivor job or miss a valid run that started immediately after the node became Ready. Plumb the exact Ready transition time out ofwaitForNodeRecoveryand use that forminPodCreationTime.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_node_replacement.go` around lines 1443 - 1454, Replace the hardcoded minPodCreationTime (time.Now().Add(-2 * time.Minute)) with the actual node Ready transition timestamp returned by waitForNodeRecovery: modify waitForNodeRecovery to return the Ready time (e.g., readyTime time.Time), propagate that return value to where minPodCreationTime is set, and pass that readyTime into services.WaitForSurvivorUpdateSetupJobCompletionByNode (and any other callers) so the poll uses the exact Ready timestamp instead of an approximate Now()-based value.
1733-1735:⚠️ Potential issue | 🟡 MinorHonor the caller-provided timeout.
This function logs the
timeoutargument but still hardcodesthreeMinuteTimeout, so callers cannot actually control the deadline.Suggested fix
}, core.RetryOptions{ - Timeout: threeMinuteTimeout, + Timeout: timeout, PollInterval: utils.FiveSecondPollInterval, }, fmt.Sprintf("etcd stop on %s", testConfig.SurvivingNode.Name))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_node_replacement.go` around lines 1733 - 1735, The RetryOptions block is ignoring the caller-provided timeout by hardcoding threeMinuteTimeout; replace the hardcoded value with the function's timeout parameter (ensure the parameter named timeout is used for RetryOptions.Timeout and its type matches time.Duration) in the RetryOptions passed where RetryOptions{ Timeout: threeMinuteTimeout, PollInterval: utils.FiveSecondPollInterval } is set so callers can control the deadline (also remove any misleading log text if it still references a different value).
1571-1581:⚠️ Potential issue | 🟠 MajorDon't return success while the old object is still present.
If this confirm loop expires, callers continue into recreate logic even though the old BMH/Machine may still exist. Return an error here so the retry/error path handles it instead of racing the stale object.
Suggested fix
for time.Now().Before(deadline) { if !resourceExists(ctx, dyn, gvr, resourceName, namespace) { e2e.Logf("Resource %s %s confirmed gone", resourceType, resourceName) return nil } time.Sleep(forceDeleteConfirmInterval) } - e2e.Logf("WARNING: %s %s still present after %v (patch was applied; it may disappear shortly)", resourceType, resourceName, forceDeleteConfirmTimeout) - return nil + err = fmt.Errorf("%s %s still present after %v even after finalizer patch", resourceType, resourceName, forceDeleteConfirmTimeout) + e2e.Logf("WARNING: %v", err) + return err }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/tnf_node_replacement.go` around lines 1571 - 1581, The confirm loop currently returns nil even when the forced-delete confirmation timeout expires, causing callers to proceed while the old object may still exist; modify the code in the function containing the loop to return a non-nil error when resourceExists never becomes false by the deadline (use the same context and include resourceType, resourceName, namespace and forceDeleteConfirmTimeout in the error message for actionable logging), keeping the successful-path return nil when resourceExists becomes false; reference the resourceExists helper and the forceDeleteConfirmTimeout/forceDeleteConfirmInterval constants to locate and update the logic.test/extended/two_node/utils/services/etcd.go (1)
356-369:⚠️ Potential issue | 🟠 MajorReturn the newest matching update-setup job.
Kubernetes LIST order is not stable. Returning the first node match can bind these waits to a stale hashed job and make the test watch the wrong run.
Suggested fix
func getUpdateSetupJobNameForNode(oc *exutil.CLI, namespace, nodeName string) (string, error) { list, err := oc.AdminKubeClient().BatchV1().Jobs(namespace).List(context.Background(), metav1.ListOptions{ LabelSelector: tnfUpdateSetupJobLabelSelector, }) if err != nil { return "", err } - for i := range list.Items { - if list.Items[i].Spec.Template.Spec.NodeName == nodeName { - return list.Items[i].Name, nil - } - } - return "", nil + var newest *batchv1.Job + for i := range list.Items { + job := &list.Items[i] + if job.Spec.Template.Spec.NodeName != nodeName { + continue + } + if newest == nil || job.CreationTimestamp.Time.After(newest.CreationTimestamp.Time) { + newest = job + } + } + if newest == nil { + return "", nil + } + return newest.Name, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@test/extended/two_node/utils/services/etcd.go` around lines 356 - 369, The function getUpdateSetupJobNameForNode currently returns the first Job whose Pod template NodeName matches, which is unstable; instead iterate all list.Items and pick the newest matching Job by comparing each item's ObjectMeta.CreationTimestamp (or .GetCreationTimestamp()) and keep the item with the latest timestamp, then return that item's Name (still return "", nil if none); update the loop in getUpdateSetupJobNameForNode to track and return the most recent matching job rather than the first match.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1393-1405: waitForCSRApproved currently relies on
HasApprovedNodeCSR which only filters by "system:node:<name>" and can match old
approved CSRs; change waitForCSRApproved to require a fresh CSR: list CSRs for
the node (instead of calling apis.HasApprovedNodeCSR), filter by the node
subject (system:node:<name>) and by creationTimestamp being after the
node-replacement start time (add/use a timestamp field on TNFTestConfig, e.g.
ReplacementStartTime or set a local start := time.Now() when initiating
replacement), then check that at least one of those recent CSRs has been
approved; keep apis.LogNodeCSRStatus for debugging but base the success
condition on an approved CSR whose creationTimestamp is >= the replacement start
time.
- Around line 834-876: The recoverBMHAndMachineFromBackup function currently
recreates only the BMC secret and the Machine; add the mirror logic to recreate
the BareMetalHost (BMH) from backup as well: after recreateBMCSecret and
before/alongside the Machine restore, check whether the BMH already exists
(using an apis.BareMetalHostExists helper or dynamic GET against
apis.BareMetalHostGVR) and if not, read the BMH YAML from the backup directory
(create a bmhFile variable similar to machineFile, e.g.,
testConfig.TargetNode.BMHName+"-bmh.yaml"), decode into an
unstructured.Unstructured, clear resourceVersion and UID, then create it via
dyn.Resource(apis.BareMetalHostGVR).Namespace(machineAPINamespace).Create with
the same core.RetryWithOptions retry pattern used for the Machine; skip creation
if it already exists and log success after recreation.
---
Duplicate comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1443-1454: Replace the hardcoded minPodCreationTime
(time.Now().Add(-2 * time.Minute)) with the actual node Ready transition
timestamp returned by waitForNodeRecovery: modify waitForNodeRecovery to return
the Ready time (e.g., readyTime time.Time), propagate that return value to where
minPodCreationTime is set, and pass that readyTime into
services.WaitForSurvivorUpdateSetupJobCompletionByNode (and any other callers)
so the poll uses the exact Ready timestamp instead of an approximate Now()-based
value.
- Around line 1733-1735: The RetryOptions block is ignoring the caller-provided
timeout by hardcoding threeMinuteTimeout; replace the hardcoded value with the
function's timeout parameter (ensure the parameter named timeout is used for
RetryOptions.Timeout and its type matches time.Duration) in the RetryOptions
passed where RetryOptions{ Timeout: threeMinuteTimeout, PollInterval:
utils.FiveSecondPollInterval } is set so callers can control the deadline (also
remove any misleading log text if it still references a different value).
- Around line 1571-1581: The confirm loop currently returns nil even when the
forced-delete confirmation timeout expires, causing callers to proceed while the
old object may still exist; modify the code in the function containing the loop
to return a non-nil error when resourceExists never becomes false by the
deadline (use the same context and include resourceType, resourceName, namespace
and forceDeleteConfirmTimeout in the error message for actionable logging),
keeping the successful-path return nil when resourceExists becomes false;
reference the resourceExists helper and the
forceDeleteConfirmTimeout/forceDeleteConfirmInterval constants to locate and
update the logic.
In `@test/extended/two_node/utils/services/etcd.go`:
- Around line 356-369: The function getUpdateSetupJobNameForNode currently
returns the first Job whose Pod template NodeName matches, which is unstable;
instead iterate all list.Items and pick the newest matching Job by comparing
each item's ObjectMeta.CreationTimestamp (or .GetCreationTimestamp()) and keep
the item with the latest timestamp, then return that item's Name (still return
"", nil if none); update the loop in getUpdateSetupJobNameForNode to track and
return the most recent matching job rather than the first match.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 5873f4be-15c6-413b-ad57-72f475981dab
📒 Files selected for processing (10)
test/extended/testdata/bindata.gotest/extended/testdata/two_node/baremetalhost-template.yamltest/extended/two_node/tnf_node_replacement.gotest/extended/two_node/utils/apis/baremetalhost.gotest/extended/two_node/utils/apis/csr.gotest/extended/two_node/utils/apis/machine.gotest/extended/two_node/utils/common.gotest/extended/two_node/utils/core/ssh.gotest/extended/two_node/utils/services/etcd.gotest/extended/two_node/utils/services/pacemaker.go
🚧 Files skipped from review as they are similar to previous changes (1)
- test/extended/two_node/utils/core/ssh.go
f1c410d to
6d7286a
Compare
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@test/extended/two_node/tnf_node_replacement.go`:
- Around line 1557-1560: resourceExists currently treats any GET error as "not
found" which hides transient failures; update callers (where Delete uses
deleteAttemptTimeout and where polling uses forceDeleteConfirmTimeout) to pass a
timeout-bounded context (use context.WithTimeout with the respective
deleteAttemptTimeout / forceDeleteConfirmTimeout) into resourceExists, and also
harden resourceExists itself: detect k8s NotFound errors (apierrors.IsNotFound)
and return false, but for other errors log the error (use the test logger or
klog) and conservatively return true (meaning "still exists/unknown") so
transient network/API errors don't masquerade as successful deletes. Ensure you
reference the resourceExists(ctx, dyn, gvr, name, namespace) function and the
deleteAttemptTimeout / forceDeleteConfirmTimeout timeouts when making these
changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: dfeefd16-08b3-49b0-9bd0-925d78caa090
📒 Files selected for processing (5)
test/extended/two_node/tnf_node_replacement.gotest/extended/two_node/utils/apis/baremetalhost.gotest/extended/two_node/utils/apis/csr.gotest/extended/two_node/utils/apis/machine.gotest/extended/two_node/utils/services/etcd.go
|
Scheduling required tests: Scheduling tests matching the |
…nHostPort for IPv6
- Use net.JoinHostPort(RedfishIP, port) so IPv6 addresses are bracketed (RFC 3986)
- BMH template placeholder {REDFISH_HOST_PORT} replaces {REDFISH_IP}; port 8000 in code
Made-with: Cursor
6d7286a to
92409ab
Compare
|
Scheduling required tests: Scheduling tests matching the |
92409ab to
00438bd
Compare
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-dualstack-recovery-techpreview |
|
@jaypoulz: trigger 2 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/b0916950-1f15-11f1-803b-2ea3c92c5131-0 |
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview |
|
@jaypoulz: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/6a8db750-1f16-11f1-8bf3-ab20a3b78e4b-0 |
|
Scheduling required tests: Scheduling tests matching the |
00438bd to
21e117c
Compare
|
/payload-job periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-recovery-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-dualstack-recovery-techpreview periodic-ci-openshift-release-main-nightly-4.22-e2e-metal-ovn-two-node-fencing-ipv6-recovery-techpreview |
|
@jaypoulz: trigger 3 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/f9792ed0-1f1c-11f1-84a5-56e1af9843b4-0 |
…H/Machine when stuck - Wait for baremetal-operator deployment ready before starting BMH/Machine deletes - Discover BMO deployment by trying metal3-baremetal-operator (metal3/dev-scripts) then baremetal-operator (standard OCP) - deleteOcResourceWithRetry: per-attempt timeout, then force-delete by patching finalizers - When force-deleting BMH and webhook is unavailable, remove webhook config and retry Made-with: Cursor
- Discover update-setup jobs by node name (label + NodeName) instead of exact job name - WaitForUpdateSetupJobCompletionByNode, WaitForSurvivorUpdateSetupJobCompletionByNode in etcd.go - restorePacemakerCluster uses ByNode so test works when job names include hash suffix Made-with: Cursor
21e117c to
f3045ae
Compare
…ry, fresh CSR, delete-confirm - Observability improvements; prefer cluster APIs; delete order; BMO/webhook readiness - recoverBMHAndMachineFromBackup: recreate BareMetalHost from backup; apis.BareMetalHostExists - waitForCSRApproved: require CSR after ReplacementStartTime; apis.HasApprovedFreshNodeCSR - resourceExists: only NotFound = absent; timeout-bounded context; log other errors - TestExecution.ReplacementStartTime; timeout renames; recovery path cleanup Made-with: Cursor
f3045ae to
efe8d75
Compare
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jaypoulz, xueqzhan The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Uh oh!
There was an error while loading. Please reload this page.