Ordering of transfers and the impact of `forced_ownership_handoff` threshold

In an ownership change (i.e. join, leave, replace or location change), once a transfer commences, assuming vnodes are sufficiently busy to prevent the `vnode_inactivity_timeout` from triggering - then all transfers will be prompted by the vnode manager's scheduled tick (default every 10s).

If there are pending transfers, this takes the list of transfers from `riak_core_ring:pending_changes/1` and applies the `forced_ownership_handoff` threshold (default 8) to examine only the first threshold items in the list.

The vnode_manager looks for the existence of itself (as a sender) in that sublist, and if there is a valid potential transfer candidate matching, the transfer will then be attempted (and may or may not succeed depending on the per-node `handoff_concurrency` transfer limit).

There are aspects of this which are potentially confusing:

- The net effect of `force_ownership_handoff` is to act as a cluster-wide limit on the number of concurrent transfers.  However, this is not documented, nor referred to in the documentation of the `handoff_concurrency` transfer limit.  So an operator following the configuration guidance may see unexpected behaviour.  In particular, a much reduced level of concurrency on transfers. Note though, that expected concurrency would be achieved should the cluster be sufficiently quiet to trigger vnode inactivity (see https://github.com/basho/riak_core/pull/120#issuecomment-3240848)
- The pending transfers from `riak_core_ring:pending_changes(Ring)` are integer-ordered by Partition (i.e. position in the ring).  It is not clear that this is a deliberate decision, that the combination of `force_ownership_handoff` and a partitioned-ordered ring-changes leading to transitions happening in partition order.  There have been cases whereby this situation has lead to nodes gaining a large number of partitions via transfer, before they release any (leading to risks of disk-space growth). The use of location awareness makes it more likely that joins to address capacity issue may also require intra-cluster shuffling.
- When using controls, such as [`riak admin handoff disable inbound`](https://docs.riak.com/riak/kv/latest/using/cluster-operations/handoff/index.html), it is possible to block all progress across the ring if the first `forced_ownership_handoff` number of pending changes are all blocked.

As a minimum there is a need to update the riak_core_schema to enable the configuration of `forced_ownership_handoff`, and also to clarify in the transfer limit configuration stanza that the limit may not be reached without also re-configuring `forced_ownership_handoff`.

It also may be worth considering whether using partition-order on `riak_core_ring:pending_changes/1` is correct.  Should the order be deterministically set so that nodes receiving transfers are evenly distributed through the list, so that they don't over-grow on an interim basis by first receiving then offloading.  However, there may be some data-safety in the existing ring order.  In that by keeping changes within the same preflist close together in order, it reduces any window whereby that preflist may be temporarily in an unsafe state (i.e. if partition in position M is moving to node X, and partition M + 1 is moving from X - if these two moves happened at the beginning and end of the list of changes, the window of unsafety when target_n_val is not met is unnecessarily extended).  

A possible re-ordering of the output of `riak_core_ring:pending_changes/1` from within the `riak_core_vnode_manager:trigger_ownership_handoff/4` might be to place all pending_changes for joining nodes at the top of the list, and leave the remainder ordered by ring position.  This would minimise the risk of interim states during joins either overloading disk-space on existing nodes, by first maximising use of fresh capacity. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ordering of transfers and the impact of `forced_ownership_handoff` threshold #1013

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ordering of transfers and the impact of forced_ownership_handoff threshold #1013

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Ordering of transfers and the impact of `forced_ownership_handoff` threshold #1013