Implement node status synchronization in Controller, including downtime notifications and online/offline management#388
Implement node status synchronization in Controller, including downtime notifications and online/offline management#388Paragrf wants to merge 2 commits intoapache:unstablefrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #388 +/- ##
============================================
+ Coverage 43.38% 50.09% +6.70%
============================================
Files 37 45 +8
Lines 2971 3885 +914
============================================
+ Hits 1289 1946 +657
- Misses 1544 1724 +180
- Partials 138 215 +77
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| ErrShardIsServicing = errors.New("shard is servicing") | ||
| ErrShardSlotIsMigrating = errors.New("shard slot is migrating") | ||
| ErrShardNoMatchNewMaster = errors.New("no match new master in shard") | ||
| ErrCannotOfflineMaster = errors.New("cannot take master node offline, failover first") |
There was a problem hiding this comment.
| ErrCannotOfflineMaster = errors.New("cannot take master node offline, failover first") | |
| ErrCannotOfflineMaster = errors.New("cannot offline the master node offline") |
| role string | ||
| password string | ||
| createdAt int64 | ||
| failed bool |
There was a problem hiding this comment.
Could we change failed to status? so that we can add more statuses in future
| func (cluster *Cluster) SetNodesOffline(addrs []string) error { | ||
| nodes := make([]Node, 0, len(addrs)) | ||
| for _, addr := range addrs { | ||
| node := cluster.findNodeByAddr(addr) | ||
| if node == nil { | ||
| return fmt.Errorf("node %s: %w", addr, consts.ErrNotFound) | ||
| } | ||
| if node.IsMaster() { | ||
| return fmt.Errorf("node %s: %w", addr, consts.ErrCannotOfflineMaster) | ||
| } | ||
| nodes = append(nodes, node) | ||
| } | ||
| for _, node := range nodes { | ||
| node.SetFailed(true) | ||
| } | ||
| return nil | ||
| } | ||
|
|
||
| func (cluster *Cluster) SetNodesOnline(addrs []string) error { | ||
| nodes := make([]Node, 0, len(addrs)) | ||
| for _, addr := range addrs { | ||
| node := cluster.findNodeByAddr(addr) | ||
| if node == nil { | ||
| return fmt.Errorf("node %s: %w", addr, consts.ErrNotFound) | ||
| } | ||
| nodes = append(nodes, node) | ||
| } | ||
| for _, node := range nodes { | ||
| node.SetFailed(false) | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Can merge them into SetNodeStatusByAddr
| return nil | ||
| } | ||
|
|
||
| func (cluster *Cluster) SetNodeFailedByID(nodeID string, failed bool) error { |
There was a problem hiding this comment.
Can change to SetNodeStatusByID
|
@Paragrf Sorry for the late review due to the heavy traffic at work. |
Background
To improve cluster reliability and operational efficiency, the controller needs to bridge the status gap between slave nodes and the server. Beyond just reporting unexpected failures, it is essential to support proactive maintenance workflows, allowing operators to safely remove traffic before performing node updates or hardware swaps.
Key Changes
Downtime Push: The controller now monitors slave node health and proactively pushes "Downtime" alerts to the server upon detection of a crash or heartbeat timeout.
Manual Offline (Traffic Draining): Supports a proactive "Offline" command. This allows the server to drain/stop traffic to a specific slave node before any maintenance work begins, ensuring zero-impact operations.
Manual Online: Supports a "Ready-to-Serve" notification when a slave is back online and fully synchronized, allowing the server to safely re-enable traffic.
Related Issues
Fixes #385