-
Notifications
You must be signed in to change notification settings - Fork 9
CP-37198: Fix incorrect webhook service name in validator config #645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
evan-cz
wants to merge
1
commit into
develop
Choose a base branch
from
CP-37198
base: develop
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+71
−8
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Customer reported pods entering CrashLoopBackOff with FailedPostStartHook events when deploying the CloudZero agent. Investigation revealed the validator's postStart hook was attempting to reach a webhook service that didn't exist, causing DNS lookup failures and ~70 second delays due to retry logic. Functional Change: Before: The validator ConfigMap referenced `cloudzero-agent-cz-webhook-svc` for the webhook service, but the actual service was named `cloudzero-agent-cz-webhook` (no `-svc` suffix). This caused the webhook_server_reachable check to fail on every deployment, blocking startup for ~70 seconds while retries exhausted. After: The validator ConfigMap correctly references the webhook service using the same helper function as the service definition, ensuring names always match. Root Cause: The validator-cm.yaml template (line 45) was introduced in commit 90e1bce (April 2025) with a hardcoded `-svc` suffix that never matched the actual service name: ```yaml insights_service: {{ include "cloudzero-agent.insightsController.server.webhookFullname" . }}-svc ``` The webhook service in webhook-service.yaml uses: ```yaml name: {{ include "cloudzero-agent.serviceName" . }} ``` Both helpers resolve to the same base name (`release-cz-webhook`), but the validator template erroneously appended `-svc`, causing DNS lookup failures. The bug went unnoticed because: 1. The enforce flag for post-start stage is `false`, so failures don't crash pods 2. The check eventually times out after ~70 seconds and returns nil 3. Federated mode deployments skip the webhook check entirely 4. Warning-level logs were easily missed Solution: 1. Changed validator-cm.yaml line 45 to use the correct helper without suffix: `insights_service: {{ include "cloudzero-agent.serviceName" . }}` 2. Added regression test (helm/tests/validator_insights_service_test.yaml) with 5 test cases verifying: - insights_service matches expected pattern with default release name - webhook service name matches the same pattern - insights_service matches with custom release names - insights_service does NOT contain `-svc` suffix (regression guard) Validation: - All tests pass, including new ones. - Manual verification: `helm template test-release ./helm --set apiKey=test-key` shows `insights_service: test-release-cz-webhook` (no `-svc` suffix) - No new test failures introduced (pre-existing failures unrelated to this change)
amfelso
approved these changes
Feb 2, 2026
github-merge-queue bot
pushed a commit
that referenced
this pull request
Feb 2, 2026
Customer reported pods entering CrashLoopBackOff with FailedPostStartHook events when deploying the CloudZero agent. Investigation revealed the validator's postStart hook was attempting to reach a webhook service that didn't exist, causing DNS lookup failures and ~70 second delays due to retry logic. Functional Change: Before: The validator ConfigMap referenced `cloudzero-agent-cz-webhook-svc` for the webhook service, but the actual service was named `cloudzero-agent-cz-webhook` (no `-svc` suffix). This caused the webhook_server_reachable check to fail on every deployment, blocking startup for ~70 seconds while retries exhausted. After: The validator ConfigMap correctly references the webhook service using the same helper function as the service definition, ensuring names always match. Root Cause: The validator-cm.yaml template (line 45) was introduced in commit 90e1bce (April 2025) with a hardcoded `-svc` suffix that never matched the actual service name: ```yaml insights_service: {{ include "cloudzero-agent.insightsController.server.webhookFullname" . }}-svc ``` The webhook service in webhook-service.yaml uses: ```yaml name: {{ include "cloudzero-agent.serviceName" . }} ``` Both helpers resolve to the same base name (`release-cz-webhook`), but the validator template erroneously appended `-svc`, causing DNS lookup failures. The bug went unnoticed because: 1. The enforce flag for post-start stage is `false`, so failures don't crash pods 2. The check eventually times out after ~70 seconds and returns nil 3. Federated mode deployments skip the webhook check entirely 4. Warning-level logs were easily missed Solution: 1. Changed validator-cm.yaml line 45 to use the correct helper without suffix: `insights_service: {{ include "cloudzero-agent.serviceName" . }}` 2. Added regression test (helm/tests/validator_insights_service_test.yaml) with 5 test cases verifying: - insights_service matches expected pattern with default release name - webhook service name matches the same pattern - insights_service matches with custom release names - insights_service does NOT contain `-svc` suffix (regression guard) Validation: - All tests pass, including new ones. - Manual verification: `helm template test-release ./helm --set apiKey=test-key` shows `insights_service: test-release-cz-webhook` (no `-svc` suffix) - No new test failures introduced (pre-existing failures unrelated to this change)
dmepham
approved these changes
Feb 2, 2026
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Customer reported pods entering CrashLoopBackOff with FailedPostStartHook events when deploying the CloudZero agent. Investigation revealed the validator's postStart hook was attempting to reach a webhook service that didn't exist, causing DNS lookup failures and ~70 second delays due to retry logic.
Functional Change:
Before: The validator ConfigMap referenced
cloudzero-agent-cz-webhook-svcfor the webhook service, but the actual service was namedcloudzero-agent-cz-webhook(no-svcsuffix). This caused the webhook_server_reachable check to fail on every deployment, blocking startup for ~70 seconds while retries exhausted.After: The validator ConfigMap correctly references the webhook service using the same helper function as the service definition, ensuring names always match.
Root Cause:
The validator-cm.yaml template (line 45) was introduced in commit 90e1bce (April 2025) with a hardcoded
-svcsuffix that never matched the actual service name:The webhook service in webhook-service.yaml uses:
Both helpers resolve to the same base name (
release-cz-webhook), but the validator template erroneously appended-svc, causing DNS lookup failures. The bug went unnoticed because:false, so failures don't crash podsSolution:
Changed validator-cm.yaml line 45 to use the correct helper without suffix:
insights_service: {{ include "cloudzero-agent.serviceName" . }}Added regression test (helm/tests/validator_insights_service_test.yaml) with 5 test cases verifying:
-svcsuffix (regression guard)Validation:
helm template test-release ./helm --set apiKey=test-keyshowsinsights_service: test-release-cz-webhook(no-svcsuffix)