How To: Restoring from backup #1081

martinsumner · 2021-09-13T19:06:51Z

martinsumner
Sep 13, 2021

Overview

Instructions are available on restoring Riak nodes from backups - https://docs.riak.com/riak/kv/latest/using/cluster-operations/backing-up/index.html#restoring-a-node.

These instructions primarily focus on how to re-attach a new node to a cluster in a replacement of an old node. In this discussion I want to go in to more detail of:

What is happening under-the-hood in terms of the recovery of data;
Monitoring the recovery process from riak logs (assuming that those logs are being indexed for reporting);

The focus is going to be on an updated Riak configuration, one using the new approaches introduced in Riak 2.9, and new features up to and including those in the Riak 3.0.8 release. In particular a cluster where:

The leveled backend is used;
Node is running in native mode using tictac_aae rather than the standard Riak active anti-entropy solution;
There is a need to support secondary index operations during recovery from failure.

Stage 0 - Starting Point

The particular scenario we have is:

An 8 node cluster;
On the cluster a leveled hot_backup is taken;
An hour after the hot backup, a node crashes;
An hour after the crash the node is recovered, but data is lost - and so the node is brought back online with a backup of its cluster information, and a backup of the data.

This cluster has been pre-loaded with some data. As we have tictac_aae, it is possible to use the aae_fold feature to find out how much - using the remote_console:

{ok, BL} = riak_client:aae_fold({list_buckets, 3}).
OCL = lists:map(fun(B) -> {ok, OS} = riak_client:aae_fold({object_stats, B, all, all}), {B, OS} end, BL).

In this case returns:

[{<<"domainDocument">>,
  [{total_count,39734967},
   {total_size,334983857951},
   {sizes,[{2,410},{3,39734557}]},
   {siblings,[{1,39734967}]}]},
 {<<"domainDocument.1">>,
  [{total_count,699345},
   {total_size,1702485022},
   {sizes,[{2,1},{3,699344}]},
   {siblings,[{1,699345}]}]},
 {<<"domainDocument.2">>,
  [{total_count,251502885},
   {total_size,612320131108},
   {sizes,[{2,221},{3,251502664}]},
   {siblings,[{1,251502885}]}]},
 {<<"domainDocument.3">>,
  [{total_count,7828095},
   {total_size,18993847738},
   {sizes,[{2,4},{3,7828091}]},
   {siblings,[{1,7828095}]}]},
 {<<"domainDocument.4">>,
  [{total_count,8214090},
   {total_size,19390986564},
   {sizes,[{2,275239},{3,7938851}]},
   {siblings,[{1,8214090}]}]},
 {<<"domainRecord">>,
  [{total_count,54748227},
   {total_size,908073815214},
   {sizes,[{3,41506930},{4,13197589},{5,43708}]},
   {siblings,[{1,54748227}]}]},
 {<<"domainRecord.1">>,
  [{total_count,436382},
   {total_size,1802668561},
   {sizes,[{2,1},{3,415456},{4,20925}]},
   {siblings,[{1,436382}]}]},
 {<<"domainRecord.2">>,
  [{total_count,109384294},
   {total_size,452972643695},
   {sizes,[{2,137},{3,104134119},{4,5250038}]},
   {siblings,[{1,109384294}]}]},
 {<<"domainRecord.3">>,
  [{total_count,4634379},
   {total_size,19110728292},
   {sizes,[{2,18},{3,4412993},{4,221368}]},
   {siblings,[{1,4634379}]}]},
 {<<"domainRecord.4">>,
  [{total_count,4695885},
   {total_size,19405779797},
   {sizes,[{2,2},{3,4471086},{4,224797}]},
   {siblings,[{1,4695885}]}]}]

So overall, nearly 500M keys in the cluster. The cluster has a ring size of 256 and a n_val of 3 so about 180M keys per node.

Note that this query is safe to run in Riak using aae_fold (whereas historically listing buckets and counting keys was not recommended). The complexity of aae_fold list_buckets scales with the number of buckets, not the number of keys - and the impact of the key counting is restricted by the size of the af4 worker pool.

Stage 1 - Taking a Hot Backup

As of Riak 3.0.7 there is no standard external API for running a backup. The documented solution is to stop the node and backup. The leveldb backend has a hot backup solution which depends on providing a trigger via the filesystem.

The leveled hot backup solution can only be triggered at present by attaching to a node (via riak attach in Riak 2.x or riak remote_console in Riak 3.0) to prompt a cluster-wide hot backup:

{ok, C} = riak:local_client().
riak_client:hotbackup("/data/riak/backup1", 3, 3, C).

The inputs to the backup function are:

A file path to which the backup should be taken, which should be on the same partition as the leveled data folder (e.g. $PLATFORM_DATA_DIR/leveled).
A pair of integers, an assumed n_val for the cluster, and an n_val for the backup. Using the same value for both these n_vals causes every primary partition to be part of the backup. Alternatively {3, 1} could be used to only backup one in three partitions to provide a backup without resilience, but {3, 3} would normally be the recommended choice.

Do not attempt to backup unless all the nodes have enabled participate_in_coverage, as results in this state may be unexpected.

The backup function should quickly provide a snapshot across the cluster, without having to stop the cluster. It does this by sending a single to all primary vnodes to:

Complete the current journal file (the head of the transaction log), and start a new file for any changes after the backup.
Take a snapshot of the Inker process (the Inker process manages the Journal at that time).
From this point the backup is non-blocking to application activity.
In the background make a copy of the manifest for the Journal in the backup folder, and then make hard links to all the files in the manifest - linking a file in the backup path to a journal file in the current vnode path.
Collate responses from vnodes, and once all have completed respond {ok, true} to the caller of the function.

All the files in the Journal are immutable - so the hard link will not increase the space on disk consumed by Riak. At some stage, the linked files may be re-written due to journal compaction (a compaction that re-writes the contents of a Journal file once a significant proportion of those contents has been replaced/deleted). If the backup still exists at the point of compaction, then it will increase the disk footprint as a file system will then require an actual copy of the file to be made to the backup folder to permit the real file to be deleted.

The process of transferring the data from the backup of the node (e.g. via rsync) is not governed by Riak.

In taking a snapshot in this way, the aim is to be non-disruptive to application traffic. In taking only a snapshot of the Journal, write amplification is minimised - so the delta size between successive backups should be the same order of magnitude as the changes within the store at that stage.

Note that backing cluster information is not included within the hot_backup function, only the vnode leveled data is copied.

Stage 2 - Recovering the Node from Backup

After the crash the node is recovered, with its previous configuration and the previous cluster metadata. The data has been recovered back to the backup folder (not to the standard $PLATFORM_DATA_DIR/leveled folder).

In this state the node is almost ready to restart. Note though that:

The AAE state is not recovered. When a node stops it saves its AAE state to reuse on re-start - however we don't wish here to use a saved AAE state along with a data point that is different (as AAE will not work as expected). Riak will respond to the lack of AAE state by rebuilding that AAE state.
The data is now in a different location, which will require the riak.conf to be updated. The manifest that was saved to the backup folder may contain absolute file paths, so restoring back to the previous folder will not work.
When the node start completes, the node will rejoin the cluster - but at this stage it will be missing data.

A node starting up with missing data, does not necessarily represent an issue in Riak. All GET and PUT operations will respect the n_val, and so as long as r and p values are at leats quorum - the correct response will still be returned from the cluster even if a request is made on an object last touched in the period between the backup and the recovery. Read repair will then fix the gap on the recovering server. Follow the read_repairs_total metric to track this action.

However, secondary index queries are r=1 operations, and once the recovered node rejoins the cluster it will potentially contribute a portion of the answers - and so 2i query results will potentially in the short term be incomplete. When a node is known not to be in a good state (perhaps following a crash), it can be rejoined to the cluster, but made ineligible for coverage plans by using the participate_in_coverage configuration option.

The participate_in_coverage option can also be controlled via the riak remote_console:

riak_client:remove_node_from_coverage()

riak_client:reset_node_for_coverage()

The remove_node_from_coverage function will push the local node out of any coverage plans being generated within the cluster (the equivalent of setting participate_in_coverage to false). The reset_node_for_coverage will return the node to the configured setting loaded at start up from the riak.conf file.

Note that participate_in_coverage is respected by all coverage queries - so that includes aae_folds as well as 2i queries.

Stage 3 - Node Startup (Recovering the Ledger)

For a Riak to start up and announce the riak_kv service available, Riak must first start each and every vnode. Until all the vnodes are started, then there will be logs such as this generated:

2021-09-09 20:40:56.686 [info] <0.749.0>@riak_core:wait_for_service:493 Waiting for service riak_kv to start (29460 seconds)
2021-09-09 20:41:57.338 [info] <0.749.0>@riak_core:wait_for_service:493 Waiting for service riak_kv to start (29520 seconds)
2021-09-09 20:42:57.953 [info] <0.749.0>@riak_core:wait_for_service:493 Waiting for service riak_kv to start (29580 seconds)
2021-09-09 20:44:01.360 [info] <0.749.0>@riak_core:wait_for_service:493 Waiting for service riak_kv to start (29640 seconds)
2021-09-09 20:44:34.133 [info] <0.749.0>@riak_core:wait_for_service:487 Wait complete for service riak_kv (29675 seconds)

Until the riak_kv service is started the node is not fully-up and participating in the cluster - so fallback vnodes will remain active, and the node will not be involved in coverage queries (regardless of the partcipate_in_coverage) setting.

Note that all vnodes have to start initially, not just those expected to be primary on this cluster. At startup the node knows nothing about the health of the rest of the cluster, so it starts ready to be in a partition of one node with all 256 (if this is the ring-size) vnodes active.

The startup of each vnode will not be complete, until the leveled backend for that vnode can rebuild the ledger by rolling over the journal accumulating the object history. As the ledger is being loaded, the vnode will periodically report the SQN it has reached:

2021-09-09T20:44:29.852  log_level=info log_ref=B0006 db_id=240 pid=<0.1954.12> Reached end of load batch with SQN 7877870

The vnodes are started in batches, to prevent too much concurrent work on the cluster. The default batch size is 16 and controlled by the riak_core environment variable vnode_rolling_start.

This process may take a long time, in the case of the 500M key cluster it took over 8 hours to recover all the ledgers and start the node. This may be improved upon by increasing the startup currency, but caution is advised as the process is CPU heavy. Here is CPU utilisation on a 12-CPU node during node start with ledger recovery:

Stage 4 - Hinted Handoff

Once the node is started and connected to the cluster, any data received since the node went down, will be sent from the fallback vnodes back to the primary vnodes on the recovered server:

2021-09-09 20:52:09.897 [info] <0.29654.414>@riak_core_handoff_sender:start_fold:158 Starting hinted transfer of riak_kv_vnode from '[email protected]' 959110449498405040071168171470060731649205731328 to '[email protected]' 959110449498405040071168171470060731649205731328
2021-09-09 20:54:07.617 [info] <0.29654.414>@riak_core_handoff_sender:start_fold:250 hinted transfer of riak_kv_vnode from '[email protected]' 959110449498405040071168171470060731649205731328 to '[email protected]' 959110449498405040071168171470060731649205731328 completed: sent 166.73 MB bytes in 273 of 52394 objects in 117.72 seconds (1.42 MB/second)

These transfers are relatively quick and will include any data PUT in the period when the node was down, and any data that was GET in that period too. The last part isn't intuitive and is because of read repair. Fallback vnodes are subject to read_repair from fetches, and so anything object that is GET for the preflist covered by the fallback node in the outage will also be added to the fallback vnode by read repair, and then sent back to the primary via hinted handoff when the primary is recovered.

Note that there will be 3 fallback vnodes elected for every primary on the failed node, so there will be a significant number of handoffs to complete. In this test there were 5M objects that needed to be handed off, and this took 45 minutes.

This speed of handoffs can be influenced by the cluster transfer limit which allows for dynamic changes to the handoff concurrency control in riak.conf. Generally a limit greater than 2 can be tolerated, however if resource limits are hit increasing the limit can have negative side effects as transfer limits hit timeouts, and then need to be restarted (repeating work already undertaken).

The progress of transfers can also be monitored from the command line.

Stage 5 - Rebuild AAE Status

Stage 5 will happen concurrently to Stage 4, and overlap with Stage 3. When each individual vnode on the recovered node is up, and there is no persisted AAE status, a rebuild will be prompted of the AAE status for that vnode. This may occur before all vnodes on the node are up. Until the recovered node has the correct AAE status cached, it will fail to correctly exchange and remaining delta (i.e. the data received between the backup being taken and the failure occurring) - exchanges will not occur before the all vnodes ar eup and the node has fully started.

The completion of the rebuild requires a fold over the newly built (and cached ledger), not the journal - and so this is much faster than the ledger rebuild in Stage 3. In this test case the rebuild time varied from between 5 and 15 minutes per vnode depending on what concurrent CPU heavy activity was occurring.

By default, only one vnode can rebuild AAE Status at a time. The concurrency is controlled via the best endeavours worker pool.

The completion of each AAE status rebuild is logged:

2021-09-09 22:41:26.709 [info] <0.1126.0>@riak_kv_vnode:queue_tictactreerebuild:441 Tree rebuild complete for partition=1096126227998177188652763624537212264741949407232 in duration=384 seconds

Stage 6 - Repairing the Delta

There now exists a remaining delta, representing all the PUTs related to the failed node which occurred between the point of the backup, and the point of the failure (where the PUT was not overwritten following the failure).

The repair of this delta is managed via AAE. However the AAE process is designed to repair such deltas slowly. Because the existence of a delta can be safely managed (via use of quorum r and w values, and participation_in_coverage), the default configuration minimises the overheads of AAE repair by severe rate limiting. The rate limiting is severe as the a trade off with tictac aae in native mode (when compared with standard active anti-entropy), is that the fetching of clocks to compare between mismatched segments is very CPU intensive.

The number of repairs committed each cycle is controlled using the tictac AAE max results control. This can be increased, however increasing by an order of magnitude will commonly cause cycles to overrun the tick timer between cycles - and this will cause cycles to be skipped slowing the process once again.

From remote_console, the max results can be altered at run-time e.g.:

application:set_env(riak_kv, tictacaae_maxresults, 1024).

The outcome of each exchange is logged:

2021-09-09T23:59:40.655  log_level=info log_ref=EX003 pid=<0.23562.30> Normal exit for full exchange purpose=kv_aae in_sync=false  pending_state=clock_compare for exchange id=57f30d5a-2ff1-4254-bf4e-97b0ab1b294b scope of mismatched_segments=41813 root_compare_loops=2  branch_compare_loops=2  keys_passed_for_repair=261

The mismatched_segments is an indicator of how many keys are in need of repair for this exchange (there are exchanges for each preflist, n_val and partition combination). The keys_passed_for_repair indicate how many have been repaired this loop (this will normally be close to the tictacaae_maxresults). There are only 1M segments, so mismatched_segments approaching this level may indicate many more keys are in need of repair (or AAE trees are still to be rebuilt).

If increasing tictacaae_maxresults leads to skipping of exchanges, this is logged:

2021-09-09 21:23:14.420 [warning] <0.19720.11>@riak_kv_vnode:handle_command:1272 Skipping a tick due to non_zero skip_count=4

Although increasing, can speed up the process, repairing a large delta may still take more than 24 hours. This time can be brought down by running the aae_fold (scheduled for release in Riak 3.0.8) repair_keys_range. This fold takes a Bucket, an optional Key Range, and an optional range of last_modified dates - and will prompt fetching of all keys in that range. The fetching of the keys, will then prompt read_repair when necessary.

If we know what the timestamp of the backup was, as well as the timestamp of the node failure - the exact delta can be targeted. For example, by running this from remote_console:

LowTime = calendar:datetime_to_gregorian_seconds({{2021, 9, 13}, {13, 30, 00}}) - 62167219200.
HighTime = calendar:datetime_to_gregorian_seconds({{2021, 9, 13}, {15, 45, 00}}) - 62167219200.
{ok, BL1} = riak_client:aae_fold({list_buckets, 3}).
RKR1 = lists:map(fun(B) -> {T, R} = timer:tc(riak_client, aae_fold, [{repair_keys_range, B, all, {date, LowTime, HighTime}, all}]), {B, T, R} end, BL1).

The pace of these read_repairs are limited by the concurrency controls on the AF4 worker queue. Note, it is important to ensure that the recovering node does not participate_in_coverage when running this query - or a section of required repairs will be missed.

If the time ranges are not known, it is possible to do some investigative work into the likely time range and impacted bucket and key ranges by logging the repairs which AAE is prompting - by logging read repairs via remote_console on the recovering node:

application:set_env(riak_kv, log_readrepair, true).

Which will log each individual repair, with the Bucket Key and mismatched clocks (which contain timestamps of updates):

2021-09-10 03:44:41.322 [info] <0.25065.44>@riak_kv_get_fsm:prompt_readrepair:666 Prompted read repair Bucket=<<"domainDocument">> Key=<<"010081695000084853">> Clocks [{<<94,110,202,11,72,104,199,72,0,2,14,122>>,{1,63795917977}}] [{<<94,110,202,11,72,104,199,72,0,2,14,122>>,{1,63795917977}},{<<94,110,202,11,72,104,199,72,0,31,214,11>>,{1,63798464615}}]

The timestamps in the clocks are in gregorian seconds, which is 62167219200 seconds more than the unix epoch time.

Stage 7 - Re-repairing the Delta (Key Amnesia)

In theory, running Stage 6 should reduce the recovery time for the delta from o(1) day to o(1) hour. However, when recovering from a backup running this repair will only repair no more than two thirds of the delta - at least one third will remain un-repaired due to key amnesia.

This can be resolved, simply by re-running the repair_keys_range query.

The reason why this occurs, and the second running of the query is required, can be confusing. This is as a result of a series of safety checks in Riak to prevent data loss in more complicated failure scenarios. Further details are available.

The vector clock for each object contains a counter of how many times each vnode has coordinated an update for that object. Every time the vnode reads an object before updating it will compare the vector clock on the stored object, with the vector clock on the inbound object - and in particular the counter for this vnode ID. The counter for the stored object should be at least as large as the counter for the inbound object, otherwise the vnode backend has dropped data.

In this case, the vnode backend has dropped data! Any PUT coordinated on the node which failed, between the backup and the failure - will show key amnesia. This is logged:

2021-09-10 00:56:24.726 [warning] <0.502.2>@riak_kv_vnode:log_key_amnesia:4425 Inbound clock entry for <<79,207,36,136,66,60,56,199>> in <<"domainRecord.5">>/<<"000230672000579093">> greater than local.Epochs: {In:2142005 Local:0}. Counters: {In:1 Local:0}.

When key amnesia occurs, the vector clock is changed on the inbound object to reflect a history of a change by this vnode at a new epoch, as well as the previous change. This does not change the object, the stored object will be the inbound (and in this case correct) object, with the correct value and the original last modified date. Only the vector clock will change on the object - but this is sufficient change to cause a new AAE mismatch.

Although the objects are now he same, as the vector clocks differ a second repair is required to bring reported state to an equilibrium within the cluster. This second repair is necessary whether an AAE exchange or a repair_keys_range query has prompted the repair.

In the test run, the graph below is showing slow repair of keys through normal AAE, and then step changes in repairs as the repair_keys_range is used to address bot the delta, and the proportion of the delta re-formed due to key amnesia:

Once all keys are repaired the pending _state in the exchange logs (EX003) will switch from clock_compare to root_compare:

2021-09-13T19:23:34.040  log_level=info log_ref=EX003 pid=<0.31256.150> Normal exit for full exchange purpose=kv_aae in_sync=true  pending_state=root_compare for exchange id=c6997363-cdeb-4009-a877-0db9493e3fec scope of mismatched_segments=0 root_compare_loops=1  branch_compare_loops=0  keys_passed_for_repair=0

At this stage it is safe to permit the node to participate_in_coverage.

Outstanding Issues

Some issues are still to be addressed, which may improve this process. Some of these issues may be resolved in the release of Riak 3.0.8:

Should be able copy backup back to normal data folder using some process, rather than have the location of data permanently changed on one node
Hot backup should be available via CLI or HTTP API, not just remote_console
There was work done on an option to read_repair on primaries only. What happened to that? Would be a useful option to speed up hinted handoff.

martinsumner · 2021-09-13T20:54:50Z

martinsumner
Sep 13, 2021
Author

Any questions on the process, please feel free to add to the discussion.

0 replies

martinsumner · 2021-09-14T08:04:51Z

martinsumner
Sep 14, 2021
Author

What would be different if the default Riak active anti-entropy would have been used not tictac aae? e.g.:

anti_entropy = active
tictacaae_active = passive

In this case the Stage 5 would be more expensive, assuming the active-entropy directory was not also backed up at the same point in time. Backing up the active-entropy was not suggested even in the cold backup documentation - though it is not clear why this would not be a good idea in a cold backup scenario. The difference in Stage 5 would likely be an order of magnitude.

The trade-off is that Stages 6 and 7 would be much more efficient and aggressive at closing the delta. The volume of keys repaired by each exchange can be seen in the logs, and will be much greater than with tictac_aae (by default it should try and fix the whole delta for each partition on each exchange). This can lead to some pressure during the cluster when such exchanges occur.

When developing tictac aae it was determined that this was a good trade-off, even though the performance difference in terms of time to resolve deltas is very high. Ultimately AAE is a backup to anti-entropy - not running AAE is considered to be passive not disabled anti_entropy, as read_repair and hinted handoff are always active in resolving entropy issues.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How To: Restoring from backup #1081

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

How To: Restoring from backup #1081

Uh oh!

Uh oh!

martinsumner Sep 13, 2021

Overview

Stage 0 - Starting Point

Stage 1 - Taking a Hot Backup

Stage 2 - Recovering the Node from Backup

Stage 3 - Node Startup (Recovering the Ledger)

Stage 4 - Hinted Handoff

Stage 5 - Rebuild AAE Status

Stage 6 - Repairing the Delta

Stage 7 - Re-repairing the Delta (Key Amnesia)

Outstanding Issues

Replies: 2 comments

Uh oh!

martinsumner Sep 13, 2021 Author

Uh oh!

martinsumner Sep 14, 2021 Author

martinsumner
Sep 13, 2021

martinsumner
Sep 13, 2021
Author

martinsumner
Sep 14, 2021
Author