Hinted handoffs do not happen when using multiple bitcask backends

(As posted on the [Riak Slack](https://postriak.slack.com/archives/C6R0LPH4N/p1618492738114900) with some modifications.)

I think I have found an interesting issue where hinted handoffs do not happen if you have a couple of bitcask backends. (Assuming default configuration for the related things below.)

Hinted handoffs are triggered when the `riak_core_vnode` (`gen_fsm` implementation) / `riak_kv_node` (the callback module we're interested of here) has been inactive for a period of 60 to 120 seconds **[1]**. It then sends an `inactive` event to the `riak_core_vnode_manager` which will trigger the handoffs **[2]**.

Thus, if the `riak_core_vnode`/`riak_kv_node` receives an event at least once per minute the hinted handoffs are never triggered.

At first I though this was related to AAE activity (we're not using Tictac AAE though) or our Telegraf plugin that extracts Riak stats every 10 seconds (after reading this issue: https://github.com/basho/riak_core/issues/715). However, I inspected a process that I expected to do a hinted handoff with `sys:log/2` and found something interesting.

With an interval of about ten seconds it received a `riak_vnode_req_v1` message with `{backend_callback, reference(), merge_check}` as the `request` **[3]**. That `merge_check` event is scheduled by `riak_kv_bitcask_backend` with an interval of 180 seconds +/- 30% **[4]**. And we currently have 15 backends (different backends for different TTL etc). So that means that an event like this is handled every 12 seconds on average and the `riak_core_vnode`/`riak_kv_node` never hits that inactivity timeout that's needed to trigger the handoffs.

This issue has been brought up before by @bipthelin but we did not know the reason back then (https://postriak.slack.com/archives/C6R0LPH4N/p1570609054000900). @martinsumner linked this issue (https://github.com/basho/riak_kv/issues/1706) as a good explanation of how handoffs are triggered but the issue itself was related to Tictac AAE (https://postriak.slack.com/archives/C6R0LPH4N/p1570618826012100).

We're currently running Riak 2.9.3 but it seems like my findings are relevant for the latest version as well.

Is there something that I'm missing here? As I see it, this is maybe not technically a bug but still an issue for what I guess is not an uncommon setup.

An easy workaround for us is to set `vnode_inactivity_timeout` to 5000 (5 to 10 seconds inactivity timeout) but that's to step around the idea of only doing handoffs when the vnode is not busy. Perhaps one should use the background manager (`riak_core_bg_manager`) for handoffs to achieve that kind of safety but I have not looked further into that. Anyway, the background manager is not used for handoffs by default.

---

I received a reply from @martinsumner containing the following:

>  I suspect your analysis is correct though, I don't think the consequences of having this many backends on things like the inactivity timeout have been considered. I would be worried about reducing the inactivity timeout, I think it would be better to change the config of the `bitcask_merge_check_interval`.  This should be changeable now in the `riak.conf` (although it currently hidden so that isn't obvious - https://github.com/basho/riak_kv/blob/develop-3.0/priv/riak_kv.schema#L749-L762)

He also asked about the reasons of having so many backends and if it's just for the TTL. Here's my reply to that:

> This is an old setup that stretches back a few years and I can't justify all the reasons but I can see that we have been using separate backends for specific buckets for full flexibility. Some examples of TTLs are 1 h, 3 h, 24 h and 4 w. We don't even use all of these ones today. After that we have added generic ones like `expiry_1h` and `expiry_24h`. Our default backend is LevelDB (we only have one there).
>
> If there would be an easy way to have either bucket or object specific TTL with a single backend there would be nothing stopping us from using that except some time and effort to migrate. I can understand the complexity of testing these multi backend combinations so what you're writing there seems like good news.

### References

#### [1] `riak_core_vnode` inactivity timeout

`riak_core_vnode` will send an `inactive` event to the `riak_core_vnode_manager` if it has not received any event for a period of 60 to 120 seconds.

Snippets from [riak_core_vnode.erl](https://github.com/basho/riak_core/blob/develop-3.0/src/riak_core_vnode.erl):

```erl
-define(DEFAULT_TIMEOUT, 60000).

Timeout = app_helper:get_env(riak_core, vnode_inactivity_timeout, ?DEFAULT_TIMEOUT),
Timeout2 = Timeout + rand:uniform(Timeout),
State#state{... inactivity_timeout=Timeout2 ...}

continue(State) ->
    {next_state, active, State, State#state.inactivity_timeout}.

active(timeout, State=#state{mod=Mod, index=Idx}) ->
    riak_core_vnode_manager:vnode_event(Mod, Idx, self(), inactive),
    continue(State);
```

#### [2] `riak_core_vnode_manager`

Snippets from [riak_core_vnode_manager.erl](https://github.com/basho/riak_core/blob/develop-3.0/src/riak_core_vnode_manager.erl):

```erl
handle_vnode_event(inactive, Mod, Idx, Pid, State) ->
    maybe_trigger_handoff(Mod, Idx, Pid, State),
    {noreply, State};
    
maybe_trigger_handoff(Mod, Idx, Pid, _State=#state{handoff=HO}) ->
    case dict:find({Mod, Idx}, HO) of
        {ok, {resize, _Direction, '$resize'}} ->
            ...
        {ok, {delete, local, '$delete'}} ->
            ...
        {ok, {_Type, _Direction, TargetNode}} ->
            riak_core_vnode:trigger_handoff(Pid, TargetNode),
            ok;
        error ->
            ok
    end.
```

#### [3] Using `sys:log/2` to see messages received by a vnode process

```erl
> riak_core_vnode_manager:all_vnodes().
[{riak_kv_vnode,867766597165223607683437869425293042920709947392,
                <0.1059.0>},
 {riak_kv_vnode,1301649895747835411525156804137939564381064921088,
                <0.1060.0>},
 {riak_kv_vnode,1278813932664540053428224228626747642198940975104,
                <0.6896.3>},
 {riak_pipe_vnode,867766597165223607683437869425293042920709947392,
                  <0.3588.0>},
 {riak_pipe_vnode,1301649895747835411525156804137939564381064921088,
                  <0.3597.0>}]
```

I expected `riak_kv_node` 1278813932664540053428224228626747642198940975104 to do its handoff and stop.

```erl
> Pid = pid(0, 6896, 3).
<0.6896.3>
> sys:log(Pid, true).
ok

% After 30 seconds or so

> sys:log(Pid, get).
{ok,[

     {{in,{timeout,#Ref<0.0.1345.165403>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.217209>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>},
     
     
     {{in,{timeout,#Ref<0.0.1363.129046>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.217575>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>},
     

     {{in,{timeout,#Ref<0.0.1370.78858>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.215724>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>}

    ]}

```

#### [4] Scheduling of the `merge_check` event

Snippets from [riak_kv_bitcask_backend.erl](https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_bitcask_backend.erl):

```erl
-define(MERGE_CHECK_INTERVAL, timer:minutes(3)).
-define(MERGE_CHECK_JITTER, 0.3).

schedule_merge(Ref) when is_reference(Ref) ->
    Interval = app_helper:get_env(riak_kv, bitcask_merge_check_interval,
                                  ?MERGE_CHECK_INTERVAL),
    JitterPerc = app_helper:get_env(riak_kv, bitcask_merge_check_jitter,
                                    ?MERGE_CHECK_JITTER),
    Jitter = Interval * JitterPerc,
    FinalInterval = Interval + trunc(2 * rand:uniform() * Jitter - Jitter),
    lager:debug("Scheduling Bitcask merge check in ~pms", [FinalInterval]),
    riak_kv_backend:callback_after(FinalInterval, Ref, merge_check).
```

Snippet from [riak_kv_backend.erl](https://github.com/basho/riak_kv/blob/develop-3.0/src/riak_kv_backend.erl):

```erl
%% Queue a callback for the backend after Time ms.
-spec callback_after(integer(), reference(), term()) -> reference().
callback_after(Time, Ref, Msg) when is_integer(Time), is_reference(Ref) ->
    riak_core_vnode:send_command_after(Time, {backend_callback, Ref, Msg}).
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Hinted handoffs do not happen when using multiple bitcask backends #1789

References

[1] `riak_core_vnode` inactivity timeout

[2] `riak_core_vnode_manager`

[3] Using `sys:log/2` to see messages received by a vnode process

[4] Scheduling of the `merge_check` event

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hinted handoffs do not happen when using multiple bitcask backends #1789

Description

References

[1] riak_core_vnode inactivity timeout

[2] riak_core_vnode_manager

[3] Using sys:log/2 to see messages received by a vnode process

[4] Scheduling of the merge_check event

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[1] `riak_core_vnode` inactivity timeout

[2] `riak_core_vnode_manager`

[3] Using `sys:log/2` to see messages received by a vnode process

[4] Scheduling of the `merge_check` event