-
Notifications
You must be signed in to change notification settings - Fork 390
Description
In an ownership change (i.e. join, leave, replace or location change), once a transfer commences, assuming vnodes are sufficiently busy to prevent the vnode_inactivity_timeout
from triggering - then all transfers will be prompted by the vnode manager's scheduled tick (default every 10s).
If there are pending transfers, this takes the list of transfers from riak_core_ring:pending_changes/1
and applies the forced_ownership_handoff
threshold (default 8) to examine only the first threshold items in the list.
The vnode_manager looks for the existence of itself (as a sender) in that sublist, and if there is a valid potential transfer candidate matching, the transfer will then be attempted (and may or may not succeed depending on the per-node handoff_concurrency
transfer limit).
There are aspects of this which are potentially confusing:
- The net effect of
force_ownership_handoff
is to act as a cluster-wide limit on the number of concurrent transfers. However, this is not documented, nor referred to in the documentation of thehandoff_concurrency
transfer limit. So an operator following the configuration guidance may see unexpected behaviour. In particular, a much reduced level of concurrency on transfers. Note though, that expected concurrency would be achieved should the cluster be sufficiently quiet to trigger vnode inactivity (see Refactor riak_core vnode management (part 2) #120 (comment)) - The pending transfers from
riak_core_ring:pending_changes(Ring)
are integer-ordered by Partition (i.e. position in the ring). It is not clear that this is a deliberate decision, that the combination offorce_ownership_handoff
and a partitioned-ordered ring-changes leading to transitions happening in partition order. There have been cases whereby this situation has lead to nodes gaining a large number of partitions via transfer, before they release any (leading to risks of disk-space growth). The use of location awareness makes it more likely that joins to address capacity issue may also require intra-cluster shuffling. - When using controls, such as
riak admin handoff disable inbound
, it is possible to block all progress across the ring if the firstforced_ownership_handoff
number of pending changes are all blocked.
As a minimum there is a need to update the riak_core_schema to enable the configuration of forced_ownership_handoff
, and also to clarify in the transfer limit configuration stanza that the limit may not be reached without also re-configuring forced_ownership_handoff
.
It also may be worth considering whether using partition-order on riak_core_ring:pending_changes/1
is correct. Should the order be deterministically set so that nodes receiving transfers are evenly distributed through the list, so that they don't over-grow on an interim basis by first receiving then offloading. However, there may be some data-safety in the existing ring order. In that by keeping changes within the same preflist close together in order, it reduces any window whereby that preflist may be temporarily in an unsafe state (i.e. if partition in position M is moving to node X, and partition M + 1 is moving from X - if these two moves happened at the beginning and end of the list of changes, the window of unsafety when target_n_val is not met is unnecessarily extended).
A possible re-ordering of the output of riak_core_ring:pending_changes/1
from within the riak_core_vnode_manager:trigger_ownership_handoff/4
might be to place all pending_changes for joining nodes at the top of the list, and leave the remainder ordered by ring position. This would minimise the risk of interim states during joins either overloading disk-space on existing nodes, by first maximising use of fresh capacity.