Skip to content

Conversation

@evan-cz
Copy link
Contributor

@evan-cz evan-cz commented Jan 28, 2026

The validator's pre-start diagnostic stage had a hardcoded enforce: true setting that wasn't actually wired up to affect behavior. This made it impossible for users to control whether diagnostic failures should block pod startup.

Additionally, when the diagnostic runner encountered errors (distinct from check failures), it would cause the validator to fail even when enforcement was disabled, effectively making enforce: false meaningless in error scenarios.

Functional Change:

Before: The enforce setting in validator config was vestigial - diagnostic failures always logged warnings but never blocked pod startup. Runner errors would crash the validator regardless of enforce setting.

After: When enforce: true, failing pre-start checks cause the validator to exit with error code 1, blocking pod startup via the lifecycle hook. When enforce: false (now the default), failures are logged and reported via telemetry but the pod starts normally. Runner errors are handled gracefully when enforcement is disabled.

Solution:

  1. Added components.validator.enforce to values.yaml with default false

    • Includes documentation explaining behavior for both settings
    • Updated JSON schema with enforce boolean property
  2. Updated helm/templates/validator-cm.yaml to use the configurable value

    • Pre-start stage now uses {{ .Values.components.validator.enforce }}
    • Other stages remain hardcoded to enforce: false
  3. Enhanced diagnostic runner (app/domain/diagnostic/runner/runner.go):

    • Added enforce and hasFailures fields to runner struct
    • NewRunner() captures enforce setting from stage config
    • Added ShouldFail() method: returns true only when enforce=true AND checks failed
    • Added IsEnforced() method: exposes enforce state for error handling
  4. Modified command.go to implement enforcement behavior:

    • After Run(), checks engine.ShouldFail() to determine exit behavior
    • When enforce=true and checks fail: logs failures and returns error
    • When enforce=false and runner errors: warns and continues (instead of Fatal)

Validation:

  • Added 5 new test functions to runner_test.go (coverage: 69.5% -> 86.7%):

    • TestRunner_ShouldFail: verifies enforce+hasFailures logic
    • TestRunner_EnforceSetFromStageConfig: verifies config parsing
    • TestRunner_HasFailuresTracking: verifies failure detection
    • TestRunner_ShouldFailIntegration: end-to-end config->behavior test
    • TestRunner_IsEnforced: verifies IsEnforced() method
  • Added helm/tests/validator_enforce_test.yaml with 6 test cases:

    • Default value (false), explicit true/false, other stages unaffected
  • Added 3 schema validation test files:

    • components.validator.enforce.true.pass.yaml
    • components.validator.enforce.false.pass.yaml
    • components.validator.enforce.invalid.fail.yaml
  • Deployed to Brahms cluster and verified:

    • enforce=true + invalid API key: pod enters Init:CrashLoopBackOff (expected)
    • enforce=false + invalid API key: pod starts normally with warnings (expected)

@evan-cz evan-cz requested a review from a team as a code owner January 28, 2026 16:29
@evan-cz evan-cz marked this pull request as draft January 28, 2026 16:41
The validator's pre-start diagnostic stage had a hardcoded `enforce: true` setting
that wasn't actually wired up to affect behavior. This made it impossible for users
to control whether diagnostic failures should block pod startup.

Additionally, when the diagnostic runner encountered errors (distinct from check
failures), it would cause the validator to fail even when enforcement was disabled,
effectively making `enforce: false` meaningless in error scenarios.

Functional Change:

Before: The `enforce` setting in validator config was vestigial - diagnostic failures
always logged warnings but never blocked pod startup. Runner errors would crash the
validator regardless of enforce setting.

After: When `enforce: true`, failing pre-start checks cause the validator to exit
with error code 1, blocking pod startup via the lifecycle hook. When `enforce: false`
(now the default), failures are logged and reported via telemetry but the pod starts
normally. Runner errors are handled gracefully when enforcement is disabled.

Solution:

1. Added `components.validator.enforce` to values.yaml with default `false`
   - Includes documentation explaining behavior for both settings
   - Updated JSON schema with enforce boolean property

2. Updated helm/templates/validator-cm.yaml to use the configurable value
   - Pre-start stage now uses `{{ .Values.components.validator.enforce }}`
   - Other stages remain hardcoded to `enforce: false`

3. Enhanced diagnostic runner (app/domain/diagnostic/runner/runner.go):
   - Added `enforce` and `hasFailures` fields to runner struct
   - `NewRunner()` captures enforce setting from stage config
   - Added `ShouldFail()` method: returns true only when enforce=true AND checks failed
   - Added `IsEnforced()` method: exposes enforce state for error handling

4. Modified command.go to implement enforcement behavior:
   - After `Run()`, checks `engine.ShouldFail()` to determine exit behavior
   - When enforce=true and checks fail: logs failures and returns error
   - When enforce=false and runner errors: warns and continues

Validation:

- Added 5 new test functions to runner_test.go (coverage: 69.5% -> 86.7%):
  - TestRunner_ShouldFail: verifies enforce+hasFailures logic
  - TestRunner_EnforceSetFromStageConfig: verifies config parsing
  - TestRunner_HasFailuresTracking: verifies failure detection
  - TestRunner_ShouldFailIntegration: end-to-end config->behavior test
  - TestRunner_IsEnforced: verifies IsEnforced() method

- Added helm/tests/validator_enforce_test.yaml with 6 test cases:
  - Default value (false), explicit true/false, other stages unaffected

- Added 3 schema validation test files:
  - components.validator.enforce.true.pass.yaml
  - components.validator.enforce.false.pass.yaml
  - components.validator.enforce.invalid.fail.yaml

- Deployed to Brahms cluster and verified:
  - enforce=true + invalid API key: pod enters Init:CrashLoopBackOff (expected)
  - enforce=false + invalid API key: pod starts normally with warnings (expected)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants