resourcewatch: harden watch loop to prevent thread exhaustion#30944
resourcewatch: harden watch loop to prevent thread exhaustion#30944jcpowermac wants to merge 1 commit intoopenshift:mainfrom
Conversation
|
Pipeline controller notification For optional jobs, comment This repository is configured in: automatic mode |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jcpowermac The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
WalkthroughAdds retry/backoff and stronger error handling to resource observation: sentinel errors, jittered exponential retry delays with special-casing for NotFound, context-aware wait logic, and stricter watch event processing. New unit tests exercise error-event handling, retry delay behavior, and watch lifecycle/cancellation. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/resourcewatch/observe/observe.go (1)
181-189: Consider removing unreachable default case.The
defaultbranch at line 187-188 is now unreachable because the outer switch (lines 159-174) already handles unknown event types by logging and continuing. OnlyAdded,Modified, andDeletedcan reach this inner switch.♻️ Suggested simplification
switch observation.Type { case watch.Added: case watch.Modified: emitUpdate(observedResources, gvr, object, resourceC) case watch.Deleted: emitDelete(observedResources, gvr, object, resourceC) - default: - log.Info("Unhandled watch event", "type", observation.Type) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/resourcewatch/observe/observe.go` around lines 181 - 189, The inner switch on observation.Type contains a redundant default branch because unknown types are already handled by the outer switch; remove the default case (the log.Info("Unhandled watch event", "type", observation.Type) branch) from the inner switch and keep only the case arms for watch.Added, watch.Modified (which calls emitUpdate(observedResources, gvr, object, resourceC)), and watch.Deleted (which calls emitDelete(observedResources, gvr, object, resourceC)) to simplify the logic and avoid unreachable code.pkg/resourcewatch/observe/observe_test.go (1)
143-147: Minor: Backoff increase test could be flaky due to jitter.With jitter applied, there's a small theoretical chance that
second <= first(e.g., if first gets +25% jitter and second gets -25% jitter). In practice this is extremely unlikely given the 2x multiplier, but for deterministic tests, consider testing without jitter or testing the trend over multiple attempts.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/resourcewatch/observe/observe_test.go` around lines 143 - 147, The test around nextRetryDelay is flaky because jitter can make a later delay appear smaller; modify the test to either disable jitter or assert the exponential trend deterministically (e.g., call nextRetryDelay with a deterministic/no-jitter mode or run multiple sequential calls and check that the median/average increases). Locate the call site in the test using nextRetryDelay(...) and change it to invoke nextRetryDelay in a deterministic way (or add a parameter/flag to nextRetryDelay to turn off jitter for tests), then assert second > first reliably or assert an increasing trend over several attempts.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/resourcewatch/observe/observe_test.go`:
- Around line 143-147: The test around nextRetryDelay is flaky because jitter
can make a later delay appear smaller; modify the test to either disable jitter
or assert the exponential trend deterministically (e.g., call nextRetryDelay
with a deterministic/no-jitter mode or run multiple sequential calls and check
that the median/average increases). Locate the call site in the test using
nextRetryDelay(...) and change it to invoke nextRetryDelay in a deterministic
way (or add a parameter/flag to nextRetryDelay to turn off jitter for tests),
then assert second > first reliably or assert an increasing trend over several
attempts.
In `@pkg/resourcewatch/observe/observe.go`:
- Around line 181-189: The inner switch on observation.Type contains a redundant
default branch because unknown types are already handled by the outer switch;
remove the default case (the log.Info("Unhandled watch event", "type",
observation.Type) branch) from the inner switch and keep only the case arms for
watch.Added, watch.Modified (which calls emitUpdate(observedResources, gvr,
object, resourceC)), and watch.Deleted (which calls
emitDelete(observedResources, gvr, object, resourceC)) to simplify the logic and
avoid unreachable code.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: e77c427c-59b6-46ed-a944-4e1ab724cfc5
📒 Files selected for processing (2)
pkg/resourcewatch/observe/observe.gopkg/resourcewatch/observe/observe_test.go
Handle watch error/bookmark events safely, always stop watch streams, and add context-aware retry backoff so reconnects do not spin and accumulate threads. - Dispatch on event type before casting to *unstructured.Unstructured so that watch.Error (which carries *metav1.Status) and watch.Bookmark events no longer trigger a cast failure and tight retry loop. - Route watch.Added events through emitUpdate so resources created after the initial list are observed immediately. - Add defer resourceWatch.Stop() so all return paths release the underlying watch stream. - Replace the bare time.Sleep with bounded exponential backoff and jitter, clamped to [500ms, 30s], using context-aware waiting for prompt cancellation. - Propagate NotFound from listAndWatchResource as an error so retry pacing is governed centrally by ObserveResource. - Add unit tests for error-event handling, watch cleanup, backoff bounds, context cancellation, and Added-event observation. Made-with: Cursor
b38a45a to
fd007d1
Compare
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pkg/resourcewatch/observe/observe.go`:
- Around line 53-66: The loop around listAndWatchResource currently retries on
all non-context errors; change it to treat decode/contract failures as terminal
by detecting those via retryReason(err) (or the specific error types your code
uses for decode/contract failures) and returning immediately instead of backing
off and retrying. Concretely, after the listAndWatchResource error check but
before computing nextRetryDelay/retrying, add a branch: if retryReason(err)
indicates a decode/contract failure (or errors.Is(err, ErrDecode) /
errors.Is(err, ErrContract)), log the terminal error and return; otherwise
continue with nextRetryDelay, waitForRetry, and retryAttempt increment as
before. Apply the same change to the other retry loops in this file that use
nextRetryDelay/retryAttempt (the other list/watch retry sites using
retryReason).
- Around line 58-70: The loop never resets retryAttempt because
listAndWatchResource currently only returns errors, so the retryAttempt++ on
error keeps growing; change the control flow so that when listAndWatchResource
finishes a clean watch cycle it returns nil (or another success signal) and the
caller sets retryAttempt = 0 before continuing; keep the existing logic that
increments retryAttempt only on error paths (when err != nil), call
nextRetryDelay/retryReason/waitForRetry only for errors, and ensure
waitForRetry(false) still returns early—update listAndWatchResource and the loop
around it (symbols: listAndWatchResource, retryAttempt, nextRetryDelay,
retryReason, waitForRetry) accordingly so healthy long-running periods reset the
retry back to base delay.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 4d90937b-24f5-49bb-b16c-ec4783748900
📒 Files selected for processing (2)
pkg/resourcewatch/observe/observe.gopkg/resourcewatch/observe/observe_test.go
✅ Files skipped from review due to trivial changes (1)
- pkg/resourcewatch/observe/observe_test.go
| if err := listAndWatchResource(ctx, log, resourceClient, gvr, observedResources, resourceC); err != nil { | ||
| log.Error(err, "failed to list and watch resource") | ||
| if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) { | ||
| return | ||
| } | ||
|
|
||
| retryDelay := nextRetryDelay(err, retryAttempt) | ||
| log.Error(err, "failed to list and watch resource", "retryReason", retryReason(err), "retryDelay", retryDelay) | ||
|
|
||
| if !waitForRetry(ctx, retryDelay) { | ||
| return | ||
| } | ||
|
|
||
| retryAttempt++ | ||
| continue |
There was a problem hiding this comment.
Don't retry terminal decode/contract failures forever.
This loop now backs off and retries every non-context error, including malformed watch payloads. Those won't self-heal, so the observer goroutine never returns; pkg/resourcewatch/observe/source.go:113-131 only closes finished after every observer exits. Keep retries for transient list/watch failures, but treat decode/contract errors as terminal.
Possible fix
if err := listAndWatchResource(ctx, log, resourceClient, gvr, observedResources, resourceC); err != nil {
if errors.Is(err, context.Canceled) || errors.Is(err, context.DeadlineExceeded) {
return
}
+ if errors.Is(err, errUnexpectedObject) {
+ log.Error(err, "terminal resource watch failure")
+ return
+ }
retryDelay := nextRetryDelay(err, retryAttempt)
log.Error(err, "failed to list and watch resource", "retryReason", retryReason(err), "retryDelay", retryDelay) case watch.Error:
status, ok := observation.Object.(*metav1.Status)
if !ok {
- return fmt.Errorf("%w: %T", errWatchErrorEvent, observation.Object)
+ return fmt.Errorf("%w: %T", errUnexpectedObject, observation.Object)
}Also applies to: 166-170, 179-181
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/resourcewatch/observe/observe.go` around lines 53 - 66, The loop around
listAndWatchResource currently retries on all non-context errors; change it to
treat decode/contract failures as terminal by detecting those via
retryReason(err) (or the specific error types your code uses for decode/contract
failures) and returning immediately instead of backing off and retrying.
Concretely, after the listAndWatchResource error check but before computing
nextRetryDelay/retrying, add a branch: if retryReason(err) indicates a
decode/contract failure (or errors.Is(err, ErrDecode) / errors.Is(err,
ErrContract)), log the terminal error and return; otherwise continue with
nextRetryDelay, waitForRetry, and retryAttempt increment as before. Apply the
same change to the other retry loops in this file that use
nextRetryDelay/retryAttempt (the other list/watch retry sites using
retryReason).
| retryDelay := nextRetryDelay(err, retryAttempt) | ||
| log.Error(err, "failed to list and watch resource", "retryReason", retryReason(err), "retryDelay", retryDelay) | ||
|
|
||
| if !waitForRetry(ctx, retryDelay) { | ||
| return | ||
| } | ||
|
|
||
| retryAttempt++ | ||
| continue | ||
| } | ||
|
|
||
| // If a watch cycle ends cleanly, start retries from the base delay. | ||
| retryAttempt = 0 |
There was a problem hiding this comment.
retryAttempt never resets with the current control flow.
The increment at Line 65 runs on every failure, but the reset at Line 70 is unreachable now that listAndWatchResource only returns errors. After enough reconnects, later watch renewals stay pinned near maxRetryDelay even after long healthy periods.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@pkg/resourcewatch/observe/observe.go` around lines 58 - 70, The loop never
resets retryAttempt because listAndWatchResource currently only returns errors,
so the retryAttempt++ on error keeps growing; change the control flow so that
when listAndWatchResource finishes a clean watch cycle it returns nil (or
another success signal) and the caller sets retryAttempt = 0 before continuing;
keep the existing logic that increments retryAttempt only on error paths (when
err != nil), call nextRetryDelay/retryReason/waitForRetry only for errors, and
ensure waitForRetry(false) still returns early—update listAndWatchResource and
the loop around it (symbols: listAndWatchResource, retryAttempt, nextRetryDelay,
retryReason, waitForRetry) accordingly so healthy long-running periods reset the
retry back to base delay.
|
Scheduling required tests: |
Code Review ReportPR: resourcewatch: harden watch loop to prevent thread exhaustion 1. Files Reviewed
2. Unit Test Coverage4 tests cover: error event handling with watch stop, backoff delay calculation (NotFound + attempts 0/1/50), Added event observation emission with cleanup, and context cancellation in waitForRetry. Gaps:
3. Idiomatic Go CodeMedium-HighVerify nil-watch guard before Medium
Verify no nil-return tight loop — LowSimplify backoff doubling — for i := 0; i < retryAttempt; i++ {
backoff = min(backoff*2, maxRetryDelay)
}Use Jitter floor clamp — 4. DRY Compliance
5. SOLID Compliance
6. Build VerificationSkipped — remote PR review without local checkout. 7. Overall Verdict: PASS WITH RECOMMENDATIONSThis is a well-targeted fix for a real production issue (thread exhaustion from unbounded watch retries). The Required Actions (blocking)
Recommended Improvements (non-blocking)
🤖 Generated with Claude Code |
|
@jcpowermac: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Handle watch error/bookmark events safely, always stop watch streams, and add context-aware retry backoff so reconnects do not spin and accumulate threads.
Made-with: Cursor