-
Notifications
You must be signed in to change notification settings - Fork 290
RS: Active-Active disaster recovery strategies #2436
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
rrelledge
wants to merge
12
commits into
main
Choose a base branch
from
DOC-5860
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
51905bc
DOC-5860 RS: Disaster recovery strategies for Active-Active databases
rrelledge ad03b8d
DOC-5860 Style fixes, added definitions and links
rrelledge 678dcc1
DOC-5860 Added A-A disaster recovery diagrams
rrelledge 14593f8
DOC-5860 Trimmed svg view boxes
rrelledge eafb01c
DOC-5860 Added diagram intros and alt text
rrelledge c4593c4
DOC-5860 Added redis-py failover link to A-A disaster recovery doc
rrelledge e019f4a
DOC-5860 Added some missing commas
rrelledge e2a1c72
DOC-5860 Copy edits and fixed diagram
rrelledge e0f0d1c
Update content/operate/rs/databases/active-active/disaster-recovery.md
rrelledge e3bd20c
Merge branch 'main' into DOC-5860
rrelledge db423b5
DOC-5860 Feedback updates for structure and diagram size
rrelledge 20f905c
DOC-5860 Fixed typo in centralized proxy diagram
rrelledge File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
68 changes: 68 additions & 0 deletions
68
content/operate/rs/databases/active-active/disaster-recovery/_index.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,68 @@ | ||
| --- | ||
| Title: Disaster recovery strategies for Active-Active databases | ||
| alwaysopen: false | ||
| categories: | ||
| - docs | ||
| - operate | ||
| - rs | ||
| - rc | ||
| description: Disaster recovery strategies for Active-Active databases using network, proxy, client library, and application-based approaches. | ||
| linkTitle: Disaster recovery | ||
| weight: 50 | ||
| --- | ||
|
|
||
| An application deployed with an Active-Active database connects to a database member that is geographically nearby. If that database member becomes unavailable, the application can fail over to a secondary Active-Active database member, and fail back to the original database member again if it recovers. | ||
|
|
||
| However, Active-Active Redis databases do not have a built-in [failover](https://en.wikipedia.org/wiki/Failover) or failback mechanism for application connections. To implement failover and failback, you can use one of the following disaster recovery strategies: | ||
|
|
||
| - [Network-based]({{<relref "/operate/rs/databases/active-active/disaster-recovery/network-based">}}): Global traffic managers and load balancers for routing. | ||
|
|
||
| - [Proxy-based]({{<relref "/operate/rs/databases/active-active/disaster-recovery/proxy-based">}}): Software proxies handle detection and routing logic. | ||
|
|
||
| - [Client library-based]({{<relref "/operate/rs/databases/active-active/disaster-recovery/client-library-based">}}): Database client libraries with built-in failover logic. | ||
|
|
||
| - [Application-based]({{<relref "/operate/rs/databases/active-active/disaster-recovery/application-based">}}): Custom application-level monitoring and connectivity management. | ||
|
|
||
| ## Detect failures with health checks | ||
|
|
||
| You can use the following health checks to help detect Active-Active database failures and determine when to failover to a secondary Active-Active member or failback to the primary member: | ||
|
|
||
| - [`PING`]({{<relref "/commands/ping">}}) or [`ECHO`]({{<relref "/commands/echo">}}). | ||
|
|
||
| - Connection timeouts or Redis errors. | ||
|
|
||
| - [Lag-aware database availability requests]({{<relref "/operate/rs/monitoring/db-availability#lag-aware">}}). | ||
|
|
||
| - Probing the keyspace with [`SET`]({{<relref "/commands/set">}}) or [`GET`]({{<relref "/commands/get">}}) commands to cover all available shards. | ||
|
|
||
| - A custom health check. | ||
|
|
||
| ## Considerations for disaster recovery | ||
|
|
||
| When implementing a disaster recovery strategy for an Active-Active database, consider the following: | ||
|
|
||
| - Is the Active-Active database an on-premise, cloud, multi-cloud, or hybrid-cloud deployment? | ||
|
|
||
| - Number of regions and availability zones. | ||
|
|
||
| - Application server redundancy and deployment locations. | ||
|
|
||
| - Acceptable values for the maximum amount of data that can be lost during a failure (Recovery Point Objective) and the maximum acceptable time to restore service after a failure (Recovery Time Objective). | ||
|
|
||
| - Latency and throughput requirements. | ||
|
|
||
| - Number of application errors that can be tolerated during a failure. | ||
|
|
||
| - Tolerance for reading stale but eventually consistent data during a failover scenario. | ||
|
|
||
| - Is concurrent access, in which different application servers can read from or write to different Active-Active database members, acceptable? | ||
|
|
||
| - Are there any regulatory or policy requirements for disaster recovery? | ||
|
|
||
| - Does the application connect to the Active-Active database using a Redis client library or through a development framework or ecosystem? | ||
|
|
||
| - Does the Active-Active database use DNS, the [OSS Cluster API]({{<relref "/operate/rs/clusters/optimize/oss-cluster-api">}}), or the [discovery service]({{<relref "/operate/rs/databases/durability-ha/discovery-service">}})? | ||
|
|
||
| - Is rate-limiting control needed? | ||
|
|
||
| - Can you modify the existing codebase or introduce new components, such as load balancers or proxies? |
16 changes: 16 additions & 0 deletions
16
content/operate/rs/databases/active-active/disaster-recovery/application-based.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,16 @@ | ||
| --- | ||
| Title: Application-based disaster recovery | ||
| alwaysopen: false | ||
| categories: | ||
| - docs | ||
| - operate | ||
| - rs | ||
| - rc | ||
| description: Application-based disaster recovery for Active-Active databases using custom application-level monitoring and connectivity management. | ||
| linkTitle: Application-based | ||
| weight: 40 | ||
| --- | ||
|
|
||
| For complete control over failover and failback, you can implement disaster recovery mechanisms directly in the application server. | ||
|
|
||
| For more information, see [Application failover with Active-Active databases]({{<relref "/operate/rs/databases/active-active/develop/app-failover-active-active">}}). |
54 changes: 54 additions & 0 deletions
54
...nt/operate/rs/databases/active-active/disaster-recovery/client-library-based.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,54 @@ | ||
| --- | ||
| Title: Client library-based disaster recovery | ||
| alwaysopen: false | ||
| categories: | ||
| - docs | ||
| - operate | ||
| - rs | ||
| - rc | ||
| description: Client library-based disaster recovery for Active-Active databases using Redis client libraries with built-in failover logic. | ||
| linkTitle: Client library-based | ||
| weight: 30 | ||
| --- | ||
|
|
||
| Some Redis client libraries support geographic failover and failback. These client libraries monitor all Active-Active database members and instantiate connections for all endpoints in advance to allow faster failover and failback. | ||
|
|
||
| Advantages: | ||
|
|
||
| - No additional hardware or software components required. | ||
|
|
||
| - No high availability considerations. | ||
|
|
||
| - No scalability concerns. | ||
|
|
||
| - Tighter control over connectivity, such as timeouts, connection retries, and dynamic reconfiguration. | ||
|
|
||
| - OSS Cluster API support. | ||
|
|
||
| - Low latency. | ||
|
|
||
| Considerations: | ||
|
|
||
| - Requires code changes for failover and failback logic. | ||
|
|
||
| - Concurrent access across replicas is possible, but can be mitigated using the distributed health status provided by the database availability API requests. | ||
|
|
||
| - When a development framework uses Redis transparently, failover and failback might not be easy to configure. | ||
|
|
||
| The following diagram shows a client library-based disaster recovery approach: | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/client-library.svg" alt="Diagram of client libraries routing traffic to Active-Active database members" width="50%"> | ||
| </div> | ||
|
|
||
| The following diagram shows a client-based disaster recovery approach that also uses [connection pooling]({{<relref "/develop/clients/pools-and-muxing#connection-pooling">}}): | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/client-library-connection-pool.svg" alt="Diagram of client libraries with connection pooling routing traffic to Active-Active database members" width="50%"> | ||
| </div> | ||
|
|
||
| For additional information, see the following client library guides for failover and failback: | ||
|
|
||
| - [Jedis (Java)]({{<relref "/develop/clients/jedis/failover">}}) | ||
|
|
||
| - [redis-py (Python)]({{<relref "/develop/clients/redis-py/failover">}}) |
86 changes: 86 additions & 0 deletions
86
content/operate/rs/databases/active-active/disaster-recovery/network-based.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,86 @@ | ||
| --- | ||
| Title: Network-based disaster recovery | ||
| alwaysopen: false | ||
| categories: | ||
| - docs | ||
| - operate | ||
| - rs | ||
| - rc | ||
| description: Network-based disaster recovery for Active-Active databases using global traffic managers and load balancing solutions. | ||
| linkTitle: Network-based | ||
| weight: 10 | ||
| --- | ||
|
|
||
| Network-based solutions use DNS or load balancing to route traffic across regions without application changes. | ||
|
|
||
| Advantages: | ||
|
|
||
| - Because routing happens at the network level: | ||
|
|
||
| - No application code changes are needed. | ||
|
|
||
| - Development frameworks are agnostic and can connect to a single Active-Active database member's endpoint. | ||
|
|
||
| ## Cross-region availability | ||
|
|
||
| For cross-region availability, you can use a global traffic manager or a global load balancer. | ||
|
|
||
| Advantages: | ||
|
|
||
| - If DNS routing is available at the application level, no additional load balancer is required between the application and the data tier to resolve the Active-Active database member's FQDN, reducing latency. | ||
|
|
||
| - Protects against data center failure since failure in one region should not affect services running in another region. | ||
|
|
||
| ### Global traffic manager | ||
|
|
||
| A global traffic manager acts as an intelligent DNS server that directs clients to healthy endpoints based on distance, latency, or availability. You should configure the traffic manager to route to the local region first and fail over to other regions if an issue occurs. | ||
|
|
||
| Advantages: | ||
|
|
||
| - High availability. | ||
|
|
||
| - Latency optimization. | ||
|
|
||
| - Seamless disaster recovery. | ||
|
|
||
| Considerations: | ||
|
|
||
| - DNS propagation delays affect failover time. | ||
|
|
||
| - DNS caches can impact proper functioning. | ||
|
|
||
| - Limited custom health check support. | ||
|
|
||
| - May route traffic during CRDT synchronization, causing stale data reads. | ||
|
|
||
| The following diagram shows how a global traffic manager with DNS resolution routes traffic: | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/gtm-with-DNS.svg" alt="Diagram of a global traffic manager routing applications to Active-Active database members across regions" width="50%"> | ||
| </div> | ||
|
|
||
| If the environment does not allow DNS resolution, you can use a load balancer to direct traffic to the cluster nodes: | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/gtm-with-load-balancer.svg" alt="Diagram of a global traffic manager with a load balancer directing traffic to Active-Active database members across regions" width="50%"> | ||
| </div> | ||
|
|
||
| ### Global load balancer | ||
|
|
||
| For real-time traffic control and more advanced routing logic for cross-region failover and failback, you can use a global load balancer. However, this solution can have higher latency than a global traffic manager. | ||
|
|
||
| The following diagram shows how a global load balancer routes traffic between regions: | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/global-load-balancer.svg" alt="Diagram of a global load balancer routing traffic between Active-Active database members in different regions" width="50%"> | ||
| </div> | ||
|
|
||
| ## Cross-zone availability | ||
|
|
||
| If your deployment does not require cross-region availability, you can use a regional load balancer to route requests to a healthy Active-Active database member in a different availability zone within the same region. | ||
|
|
||
| The following diagram shows how a regional load balancer routes traffic across availability zones: | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/regional-load-balancer.svg" alt="Diagram of a regional load balancer routing traffic across availability zones within a single region" width="50%"> | ||
| </div> |
98 changes: 98 additions & 0 deletions
98
content/operate/rs/databases/active-active/disaster-recovery/proxy-based.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,98 @@ | ||
| --- | ||
| Title: Proxy-based disaster recovery | ||
| alwaysopen: false | ||
| categories: | ||
| - docs | ||
| - operate | ||
| - rs | ||
| - rc | ||
| description: Proxy-based disaster recovery for Active-Active databases. | ||
| linkTitle: Proxy-based | ||
| weight: 20 | ||
| --- | ||
|
|
||
| If you add a lightweight proxy software component between the clients and the Active-Active database, applications can dynamically route requests to the optimal endpoint. | ||
|
|
||
| Advantages: | ||
|
|
||
| - Proxies provide proactive and reactive health check methods, such as polling target health periodically using either a TCP connection or an HTTP request, or monitoring live operations for errors. | ||
|
|
||
| - Proxies can be configured to run Active-Active health checks, such as the lag-aware database availability requests. | ||
|
|
||
| - If an Active-Active database member fails, a proxy can automatically detect the issue and redirect traffic to a healthy Active-Active database member without requiring DNS propagation delays or client disconnections. This enables fast, controlled failover and minimizes downtime. | ||
|
|
||
| Considerations: | ||
|
|
||
| - If you do not use DNS to resolve the Active-Active database members' FQDNs: | ||
|
|
||
| - The proxies must have static IPs. | ||
|
|
||
| - If you add a new node to the cluster, you must configure the proxy with the new endpoint. | ||
|
|
||
| - A configuration syncer component is required to discover topology changes and reconfigure the proxy. | ||
|
|
||
| - Proxies introduce latency. | ||
|
|
||
| - Proxy failures can disconnect clients and cause disruptions. | ||
|
|
||
| ## Avoid concurrent access across replicas | ||
|
|
||
| If concurrent access across replicas must be avoided in every scenario, you can use a centralized proxy with a standby proxy instance for high availability. | ||
|
|
||
| Advantages: | ||
|
|
||
| - Prevents concurrent access across replicas. | ||
|
|
||
| - Failover and failback are simultaneous regardless of the Active-Active health check policy. | ||
|
|
||
| Considerations: | ||
|
|
||
| - Although the proxy can be monitored with a watchdog and restarted in case of failure, this setup does not grant high availability for the proxy. | ||
|
|
||
| - Limited scalability. | ||
|
|
||
| The following diagram shows a centralized proxy architecture with a standby proxy instance: | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/centralized-proxy.svg" alt="Diagram of a centralized proxy architecture with active and standby proxy instances routing to Active-Active database members" width="50%"> | ||
| </div> | ||
|
|
||
| ## Co-locate to reduce latency and improve scalability | ||
|
|
||
| To reduce latency and improve scalability, you can use a proxy co-located in the application server. | ||
|
|
||
| Advantages: | ||
|
|
||
| - Reduced latency. | ||
|
|
||
| - Better scalability. | ||
|
|
||
| Considerations: | ||
|
|
||
| - Failover and failback might not be simultaneous depending on the Active-Active health check policy. | ||
|
|
||
| The following diagram shows a co-located proxy architecture where each application server has its own proxy: | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/co-located-proxy-and-app.svg" alt="Diagram of co-located proxy architecture where each application server has its own proxy instance" width="50%"> | ||
| </div> | ||
|
|
||
| ## Pool proxies for scalability | ||
|
|
||
| You can use a pool of active proxies to scale the routing layer. Application servers can balance new connections to the pool of proxies using a round-robin distribution algorithm, such as DNS-based round robin. | ||
|
|
||
| Advantages: | ||
|
|
||
| - High availability without complex monitoring and failover solutions. | ||
|
|
||
| - Flexible scalability of the routing layer. | ||
|
|
||
| Considerations: | ||
|
|
||
| - Concurrent access across replicas is possible, but can be mitigated using database availability API requests. | ||
|
|
||
| The following diagram shows a pool of proxies: | ||
|
|
||
| <div class="flex justify-center"> | ||
| <img src="../../../../../../images/active-active-disaster-recovery/proxy-pool.svg" alt="Diagram of a pool of active proxy instances" width="50%"> | ||
| </div> |
1 change: 1 addition & 0 deletions
1
static/images/active-active-disaster-recovery/centralized-proxy.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
static/images/active-active-disaster-recovery/client-library-connection-pool.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
static/images/active-active-disaster-recovery/client-library.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
static/images/active-active-disaster-recovery/co-located-proxy-and-app.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
static/images/active-active-disaster-recovery/global-load-balancer.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
static/images/active-active-disaster-recovery/gtm-with-load-balancer.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions
1
static/images/active-active-disaster-recovery/regional-load-balancer.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.