Skip to content

Add configurable database health check with automatic restart on failure#38

Open
rositsa-popova wants to merge 2 commits intocloudfoundry:mainfrom
rositsa-popova:add-locket-db-health-check
Open

Add configurable database health check with automatic restart on failure#38
rositsa-popova wants to merge 2 commits intocloudfoundry:mainfrom
rositsa-popova:add-locket-db-health-check

Conversation

@rositsa-popova
Copy link

Summary

Implements a configurable database health check for Locket that monitors database connectivity and automatically restarts the process when failures are detected. Follows the same pattern as BBS (cloudfoundry/bbs#134).

Resolves: cloudfoundry/diego-release#1105

Problem: Locket can enter a degraded state when the database becomes unresponsive, with no automatic recovery mechanism.

Solution:

  • Periodic health check (UPSERT + SELECT on dedicated table)
  • Configurable interval, timeout, and failure threshold
  • Automatic process exit on sustained failures, allowing BOSH to restart
  • Disabled by default for backward compatibility

Test Results

All tests passed on dev landscape with PostgreSQL backend:

Test 1 - Backward Compatibility (Health Check Disabled): ✅

  • Verified no health check activity when enable_db_health_check: false (default)
  • Locket functions normally without any behavior changes

Test 2 - Health Check Enabled and Working: ✅

  • Health check runner started successfully
  • 54+ consecutive successful health checks observed
  • Checks occurring every 10 seconds as configured
  • Logs at INFO level: locket.db-health-check-runner.health-check-succeeded

Test 3 - Database Failure Detection: ✅

  • Blocked database connectivity using iptables (PostgreSQL port 5432)
  • Health check detected failure in 12 seconds
  • Three consecutive timeouts recorded (5s each)
  • Log message: "database-failure-detected-restarting-locket"
  • Locket process exited and was restarted by BOSH monit
  • Health checks resumed successfully after recovery

Test 4 - Timeout Protection: ✅

  • Added 10 second network delay to database traffic
  • Health check timeout (5s) triggered correctly - didn't wait full 10s
  • Three consecutive timeouts detected
  • Locket restarted as expected
  • System recovered after removing delay
  • Confirms health checks don't hang on slow database

Test 5 - Configuration Parameters: ✅

  • Verified all configuration values applied correctly:
    • enable_db_health_check: true
    • health_check_interval: 10s
    • health_check_timeout: 5s
    • health_check_failure_threshold: 3
  • Measured actual interval: exactly 10 seconds between checks

Database Support

  • ✅ MySQL 8.0 (tested in docker)
  • ✅ PostgreSQL (tested in docker and on dev landscape)

Backward Compatibility

Breaking Change? No

This feature is disabled by default and requires explicit operator opt-in via the enable_db_health_check BOSH property. When disabled (default), Locket behaves exactly as before with no changes to functionality or performance.

When enabled:

  • Locket creates an additional table locket_health_check (simple 2-column table)
  • Minimal performance overhead: one UPSERT + SELECT every 10 seconds
  • Only restarts on sustained database failures (3+ consecutive failures)
  • No breaking changes to existing APIs, interfaces, or behaviors

@rositsa-popova rositsa-popova requested a review from a team as a code owner March 4, 2026 08:31
@linux-foundation-easycla
Copy link

CLA Not Signed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

1 participant