Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[doc][ybm] DR for Aeon #26325

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
---
title: Configure disaster recovery for an Aeon cluster
headerTitle: Disaster Recovery
linkTitle: Disaster recovery
description: Enable Disaster recovery for clusters
headContent: Fail over to a replica cluster in case of unplanned outages
tags:
feature: early-access
menu:
preview_yugabyte-cloud:
parent: cloud-clusters
identifier: disaster-recovery-aeon
weight: 500
type: indexpage
showRightNav: true
---

Use xCluster Disaster Recovery (DR) to recover from an unplanned outage (failover) or to perform a planned switchover. Planned switchover is commonly used for business continuity and disaster recovery testing, and failback after a failover.

A DR configuration consists of the following:

- a DR primary cluster, which serves both reads and writes.
- a DR replica cluster, which can also serve reads.

## RPO and RTO for failover and switchover

Data from the DR primary is replicated asynchronously to the DR replica (which is read only). Due to the asynchronous nature of the replication, DR failover results in non-zero recovery point objective (RPO). In other words, data not yet committed on the DR replica _can be lost_ during a failover. The amount of data loss depends on the replication lag, which in turn depends on the network characteristics between the clusters. By contrast, during a switchover RPO is zero, and no data is lost, because the switchover waits for all data to be committed on the DR replica _before_ switching over.

The recovery time objective (RTO) for failover or switchover is very low, and determined by how long it takes applications to switch their connections from one cluster to another. Applications should be designed in such a way that the switch happens as quickly as possible.

DR further allows for the role of each cluster to switch during planned switchover and unplanned failover scenarios.

![Disaster recovery](/images/yb-platform/disaster-recovery/disaster-recovery.png)

{{<lead link="../../../yugabyte-platform/back-up-restore-universes/disaster-recovery/#xcluster-dr-vs-xcluster-replication">}}
[xCluster DR vs xCluster Replication](../../../yugabyte-platform/back-up-restore-universes/disaster-recovery/#xcluster-dr-vs-xcluster-replication)
{{</lead>}}

&nbsp;

{{<index/block>}}

{{<index/item
title="Set up Disaster Recovery"
body="Designate a cluster to act as a DR replica."
href="disaster-recovery-setup/"
icon="fa-thin fa-umbrella">}}

{{<index/item
title="Unplanned failover"
body="Fail over to the DR replica in case of an unplanned outage."
href="disaster-recovery-failover/"
icon="fa-thin fa-cloud-bolt-sun">}}

{{<index/item
title="Planned switchover"
body="Switch over to the DR replica for planned testing and failback."
href="disaster-recovery-switchover/"
icon="fa-thin fa-toggle-on">}}

{{<index/item
title="Add and remove tables and indexes"
body="Perform DDL changes to databases in replication."
href="disaster-recovery-tables/"
icon="fa-thin fa-plus-minus">}}

{{</index/block>}}

## Schema changes

Table and index-level schema changes must be performed in the same order as follows:

1. The DR primary cluster.
2. The DR replica cluster.

You don't need to make any changes to the DR configuration.

{{<lead link="./disaster-recovery-tables/">}}
To learn more, refer to [Manage tables and indexes](./disaster-recovery-tables/)
{{</lead>}}

## Limitations

- Currently, automatic replication of DDL (SQL-level changes such as creating or dropping tables or indexes) is not supported. For more details on how to propagate DDL changes from the DR primary to the DR replica, see [Manage tables and indexes](./disaster-recovery-tables/).

- If a database operation requires a full copy, any application sessions on the database on the DR target will be interrupted while the database is dropped and recreated. Your application should either retry connections or redirect reads to the DR primary.
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: Unplanned failover to a target Aeon cluster
headerTitle: Unplanned failover
linkTitle: Failover
description: Unplanned failover to a target cluster
headContent: Failover of application traffic to the DR target
menu:
preview_yugabyte-cloud:
parent: disaster-recovery-aeon
identifier: disaster-recovery-failover-aeon
weight: 30
type: docs
---

Unplanned failover is the process of switching application traffic to the DR replica cluster in case the DR primary cluster becomes unavailable. One of the common reasons for such a scenario is an outage of the primary region.

## Perform failover

Use the following procedure to perform an unplanned failover to the DR replica and resume applications.

If the DR primary is terminated for some reason, do the following:

1. Stop the application traffic to ensure no more updates are attempted.

1. Navigate to your DR primary cluster **Disaster Recovery** tab and select the replication configuration.

1. Note the **Potential data loss on failover** to understand the extent of possible data loss as a result of the outage, and determine if the extent of data loss is acceptable for your situation.

- The potential data loss is computed as the safe time lag that existed at the current safe time on the DR replica.
- Use the **Tables** tab to understand which specific tables have the highest safe time lag and replication lag.

For more information on replication metrics, refer to [Replication](../../../../launch-and-manage/monitor-and-alert/metrics/replication/).

1. To proceed, click **Actions** and choose **Failover**.

1. Enter the name of the DR replica and click **Initiate Failover**.

1. Resume the application traffic on the new DR primary.

At this point, the DR configuration is halted and needs to be repaired.

![Disaster recovery failed](/images/yb-platform/disaster-recovery/disaster-recovery-failed.png)

## Repair DR after failover

There are two options to repair a DR that has failed over:

- If the original DR primary has recovered and is fully functional with no active alerts, you can configure DR to use this cluster as a DR replica.
- If the original DR primary cannot be recovered, create a new cluster to be configured to act as the DR replica (see [Prerequisites](../disaster-recovery-setup/#prerequisites)).

In both cases, repairing DR involves making a full copy of the databases through the backup-restore process.

To repair DR, do the following:

1. Navigate to your (new) DR primary cluster **Disaster Recovery** tab and select the replication configuration.

1. Click **Repair DR** to display the **Repair DR** dialog.

![Repair DR](/images/yb-platform/disaster-recovery/disaster-recovery-repair.png)

1. If the current DR replica (formerly the DR primary) has recovered and is fully functional with no active alerts, choose **Reuse the current DR replica**.

To use a new cluster as the DR replica, choose **Select a new cluster as DR replica** and select the cluster.

1. Click **Initiate Repair**.

After the repair is complete, if your eventual desired configuration is for the replica (that is, the former primary if you chose Reuse, or the new one you added to DR to act as DR replica) to be the DR primary, follow the steps for [Planned switchover](../disaster-recovery-switchover/).

{{< warning title="Important" >}}
Do not attempt a switchover if you have not first repaired DR.
{{< /warning >}}
Loading