How To: Restoring from backup #1081
Replies: 2 comments
-
Any questions on the process, please feel free to add to the discussion. |
Beta Was this translation helpful? Give feedback.
-
What would be different if the default Riak active anti-entropy would have been used not tictac aae? e.g.:
In this case the Stage 5 would be more expensive, assuming the active-entropy directory was not also backed up at the same point in time. Backing up the active-entropy was not suggested even in the cold backup documentation - though it is not clear why this would not be a good idea in a cold backup scenario. The difference in Stage 5 would likely be an order of magnitude. The trade-off is that Stages 6 and 7 would be much more efficient and aggressive at closing the delta. The volume of keys repaired by each exchange can be seen in the logs, and will be much greater than with tictac_aae (by default it should try and fix the whole delta for each partition on each exchange). This can lead to some pressure during the cluster when such exchanges occur. When developing tictac aae it was determined that this was a good trade-off, even though the performance difference in terms of time to resolve deltas is very high. Ultimately AAE is a backup to anti-entropy - not running AAE is considered to be passive not disabled anti_entropy, as read_repair and hinted handoff are always active in resolving entropy issues. |
Beta Was this translation helpful? Give feedback.
-
Overview
Instructions are available on restoring Riak nodes from backups - https://docs.riak.com/riak/kv/latest/using/cluster-operations/backing-up/index.html#restoring-a-node.
These instructions primarily focus on how to re-attach a new node to a cluster in a replacement of an old node. In this discussion I want to go in to more detail of:
The focus is going to be on an updated Riak configuration, one using the new approaches introduced in Riak 2.9, and new features up to and including those in the Riak 3.0.8 release. In particular a cluster where:
native
mode using tictac_aae rather than the standard Riak active anti-entropy solution;Stage 0 - Starting Point
The particular scenario we have is:
This cluster has been pre-loaded with some data. As we have tictac_aae, it is possible to use the
aae_fold
feature to find out how much - using theremote_console
:In this case returns:
So overall, nearly 500M keys in the cluster. The cluster has a ring size of 256 and a n_val of 3 so about 180M keys per node.
Note that this query is safe to run in Riak using aae_fold (whereas historically listing buckets and counting keys was not recommended). The complexity of aae_fold list_buckets scales with the number of buckets, not the number of keys - and the impact of the key counting is restricted by the size of the af4 worker pool.
Stage 1 - Taking a Hot Backup
As of Riak 3.0.7 there is no standard external API for running a backup. The documented solution is to stop the node and backup. The leveldb backend has a hot backup solution which depends on providing a trigger via the filesystem.
The leveled hot backup solution can only be triggered at present by attaching to a node (via
riak attach
in Riak 2.x orriak remote_console
in Riak 3.0) to prompt a cluster-wide hot backup:The inputs to the backup function are:
Do not attempt to backup unless all the nodes have enabled
participate_in_coverage
, as results in this state may be unexpected.The backup function should quickly provide a snapshot across the cluster, without having to stop the cluster. It does this by sending a single to all primary vnodes to:
{ok, true}
to the caller of the function.All the files in the Journal are immutable - so the hard link will not increase the space on disk consumed by Riak. At some stage, the linked files may be re-written due to journal compaction (a compaction that re-writes the contents of a Journal file once a significant proportion of those contents has been replaced/deleted). If the backup still exists at the point of compaction, then it will increase the disk footprint as a file system will then require an actual copy of the file to be made to the backup folder to permit the real file to be deleted.
The process of transferring the data from the backup of the node (e.g. via rsync) is not governed by Riak.
In taking a snapshot in this way, the aim is to be non-disruptive to application traffic. In taking only a snapshot of the Journal, write amplification is minimised - so the delta size between successive backups should be the same order of magnitude as the changes within the store at that stage.
Note that backing cluster information is not included within the hot_backup function, only the vnode leveled data is copied.
Stage 2 - Recovering the Node from Backup
After the crash the node is recovered, with its previous configuration and the previous cluster metadata. The data has been recovered back to the backup folder (not to the standard $PLATFORM_DATA_DIR/leveled folder).
In this state the node is almost ready to restart. Note though that:
A node starting up with missing data, does not necessarily represent an issue in Riak. All GET and PUT operations will respect the n_val, and so as long as r and p values are at leats quorum - the correct response will still be returned from the cluster even if a request is made on an object last touched in the period between the backup and the recovery. Read repair will then fix the gap on the recovering server. Follow the
read_repairs_total
metric to track this action.However, secondary index queries are r=1 operations, and once the recovered node rejoins the cluster it will potentially contribute a portion of the answers - and so 2i query results will potentially in the short term be incomplete. When a node is known not to be in a good state (perhaps following a crash), it can be rejoined to the cluster, but made ineligible for coverage plans by using the
participate_in_coverage
configuration option.The
participate_in_coverage
option can also be controlled via theriak remote_console
:The
remove_node_from_coverage
function will push the local node out of any coverage plans being generated within the cluster (the equivalent of setting participate_in_coverage to false). Thereset_node_for_coverage
will return the node to the configured setting loaded at start up from the riak.conf file.Note that
participate_in_coverage
is respected by all coverage queries - so that includes aae_folds as well as 2i queries.Stage 3 - Node Startup (Recovering the Ledger)
For a Riak to start up and announce the
riak_kv
service available, Riak must first start each and every vnode. Until all the vnodes are started, then there will be logs such as this generated:Until the
riak_kv
service is started the node is not fully-up and participating in the cluster - so fallback vnodes will remain active, and the node will not be involved in coverage queries (regardless of thepartcipate_in_coverage
) setting.Note that all vnodes have to start initially, not just those expected to be primary on this cluster. At startup the node knows nothing about the health of the rest of the cluster, so it starts ready to be in a partition of one node with all 256 (if this is the ring-size) vnodes active.
The startup of each vnode will not be complete, until the leveled backend for that vnode can rebuild the ledger by rolling over the journal accumulating the object history. As the ledger is being loaded, the vnode will periodically report the SQN it has reached:
The vnodes are started in batches, to prevent too much concurrent work on the cluster. The default batch size is 16 and controlled by the riak_core environment variable
vnode_rolling_start
.This process may take a long time, in the case of the 500M key cluster it took over 8 hours to recover all the ledgers and start the node. This may be improved upon by increasing the startup currency, but caution is advised as the process is CPU heavy. Here is CPU utilisation on a 12-CPU node during node start with ledger recovery:
Stage 4 - Hinted Handoff
Once the node is started and connected to the cluster, any data received since the node went down, will be sent from the fallback vnodes back to the primary vnodes on the recovered server:
These transfers are relatively quick and will include any data PUT in the period when the node was down, and any data that was GET in that period too. The last part isn't intuitive and is because of read repair. Fallback vnodes are subject to read_repair from fetches, and so anything object that is GET for the preflist covered by the fallback node in the outage will also be added to the fallback vnode by read repair, and then sent back to the primary via hinted handoff when the primary is recovered.
Note that there will be 3 fallback vnodes elected for every primary on the failed node, so there will be a significant number of handoffs to complete. In this test there were 5M objects that needed to be handed off, and this took 45 minutes.
This speed of handoffs can be influenced by the cluster transfer limit which allows for dynamic changes to the handoff concurrency control in riak.conf. Generally a limit greater than 2 can be tolerated, however if resource limits are hit increasing the limit can have negative side effects as transfer limits hit timeouts, and then need to be restarted (repeating work already undertaken).
The progress of transfers can also be monitored from the command line.
Stage 5 - Rebuild AAE Status
Stage 5 will happen concurrently to Stage 4, and overlap with Stage 3. When each individual vnode on the recovered node is up, and there is no persisted AAE status, a rebuild will be prompted of the AAE status for that vnode. This may occur before all vnodes on the node are up. Until the recovered node has the correct AAE status cached, it will fail to correctly exchange and remaining delta (i.e. the data received between the backup being taken and the failure occurring) - exchanges will not occur before the all vnodes ar eup and the node has fully started.
The completion of the rebuild requires a fold over the newly built (and cached ledger), not the journal - and so this is much faster than the ledger rebuild in Stage 3. In this test case the rebuild time varied from between 5 and 15 minutes per vnode depending on what concurrent CPU heavy activity was occurring.
By default, only one vnode can rebuild AAE Status at a time. The concurrency is controlled via the best endeavours worker pool.
The completion of each AAE status rebuild is logged:
Stage 6 - Repairing the Delta
There now exists a remaining delta, representing all the PUTs related to the failed node which occurred between the point of the backup, and the point of the failure (where the PUT was not overwritten following the failure).
The repair of this delta is managed via AAE. However the AAE process is designed to repair such deltas slowly. Because the existence of a delta can be safely managed (via use of quorum r and w values, and participation_in_coverage), the default configuration minimises the overheads of AAE repair by severe rate limiting. The rate limiting is severe as the a trade off with tictac aae in native mode (when compared with standard active anti-entropy), is that the fetching of clocks to compare between mismatched segments is very CPU intensive.
The number of repairs committed each cycle is controlled using the tictac AAE max results control. This can be increased, however increasing by an order of magnitude will commonly cause cycles to overrun the tick timer between cycles - and this will cause cycles to be skipped slowing the process once again.
From remote_console, the max results can be altered at run-time e.g.:
The outcome of each exchange is logged:
The
mismatched_segments
is an indicator of how many keys are in need of repair for this exchange (there are exchanges for each preflist, n_val and partition combination). Thekeys_passed_for_repair
indicate how many have been repaired this loop (this will normally be close to thetictacaae_maxresults
). There are only 1M segments, somismatched_segments
approaching this level may indicate many more keys are in need of repair (or AAE trees are still to be rebuilt).If increasing
tictacaae_maxresults
leads to skipping of exchanges, this is logged:Although increasing, can speed up the process, repairing a large delta may still take more than 24 hours. This time can be brought down by running the aae_fold (scheduled for release in Riak 3.0.8)
repair_keys_range
. This fold takes a Bucket, an optional Key Range, and an optional range of last_modified dates - and will prompt fetching of all keys in that range. The fetching of the keys, will then prompt read_repair when necessary.If we know what the timestamp of the backup was, as well as the timestamp of the node failure - the exact delta can be targeted. For example, by running this from
remote_console
:The pace of these read_repairs are limited by the concurrency controls on the AF4 worker queue. Note, it is important to ensure that the recovering node does not
participate_in_coverage
when running this query - or a section of required repairs will be missed.If the time ranges are not known, it is possible to do some investigative work into the likely time range and impacted bucket and key ranges by logging the repairs which AAE is prompting - by logging read repairs via
remote_console
on the recovering node:Which will log each individual repair, with the Bucket Key and mismatched clocks (which contain timestamps of updates):
The timestamps in the clocks are in gregorian seconds, which is 62167219200 seconds more than the unix epoch time.
Stage 7 - Re-repairing the Delta (Key Amnesia)
In theory, running Stage 6 should reduce the recovery time for the delta from o(1) day to o(1) hour. However, when recovering from a backup running this repair will only repair no more than two thirds of the delta - at least one third will remain un-repaired due to key amnesia.
This can be resolved, simply by re-running the
repair_keys_range
query.The reason why this occurs, and the second running of the query is required, can be confusing. This is as a result of a series of safety checks in Riak to prevent data loss in more complicated failure scenarios. Further details are available.
The vector clock for each object contains a counter of how many times each vnode has coordinated an update for that object. Every time the vnode reads an object before updating it will compare the vector clock on the stored object, with the vector clock on the inbound object - and in particular the counter for this vnode ID. The counter for the stored object should be at least as large as the counter for the inbound object, otherwise the vnode backend has dropped data.
In this case, the vnode backend has dropped data! Any PUT coordinated on the node which failed, between the backup and the failure - will show key amnesia. This is logged:
When key amnesia occurs, the vector clock is changed on the inbound object to reflect a history of a change by this vnode at a new epoch, as well as the previous change. This does not change the object, the stored object will be the inbound (and in this case correct) object, with the correct value and the original last modified date. Only the vector clock will change on the object - but this is sufficient change to cause a new AAE mismatch.
Although the objects are now he same, as the vector clocks differ a second repair is required to bring reported state to an equilibrium within the cluster. This second repair is necessary whether an AAE exchange or a
repair_keys_range
query has prompted the repair.In the test run, the graph below is showing slow repair of keys through normal AAE, and then step changes in repairs as the
repair_keys_range
is used to address bot the delta, and the proportion of the delta re-formed due to key amnesia:Once all keys are repaired the
pending _state
in the exchange logs (EX003) will switch from clock_compare to root_compare:At this stage it is safe to permit the node to
participate_in_coverage
.Outstanding Issues
Some issues are still to be addressed, which may improve this process. Some of these issues may be resolved in the release of Riak 3.0.8:
Beta Was this translation helpful? Give feedback.
All reactions