Skip to content

Hinted handoffs do not happen when using multiple bitcask backends #1789

Open
@zigge-zagge

Description

@zigge-zagge

(As posted on the Riak Slack with some modifications.)

I think I have found an interesting issue where hinted handoffs do not happen if you have a couple of bitcask backends. (Assuming default configuration for the related things below.)

Hinted handoffs are triggered when the riak_core_vnode (gen_fsm implementation) / riak_kv_node (the callback module we're interested of here) has been inactive for a period of 60 to 120 seconds [1]. It then sends an inactive event to the riak_core_vnode_manager which will trigger the handoffs [2].

Thus, if the riak_core_vnode/riak_kv_node receives an event at least once per minute the hinted handoffs are never triggered.

At first I though this was related to AAE activity (we're not using Tictac AAE though) or our Telegraf plugin that extracts Riak stats every 10 seconds (after reading this issue: basho/riak_core#715). However, I inspected a process that I expected to do a hinted handoff with sys:log/2 and found something interesting.

With an interval of about ten seconds it received a riak_vnode_req_v1 message with {backend_callback, reference(), merge_check} as the request [3]. That merge_check event is scheduled by riak_kv_bitcask_backend with an interval of 180 seconds +/- 30% [4]. And we currently have 15 backends (different backends for different TTL etc). So that means that an event like this is handled every 12 seconds on average and the riak_core_vnode/riak_kv_node never hits that inactivity timeout that's needed to trigger the handoffs.

This issue has been brought up before by @bipthelin but we did not know the reason back then (https://postriak.slack.com/archives/C6R0LPH4N/p1570609054000900). @martinsumner linked this issue (#1706) as a good explanation of how handoffs are triggered but the issue itself was related to Tictac AAE (https://postriak.slack.com/archives/C6R0LPH4N/p1570618826012100).

We're currently running Riak 2.9.3 but it seems like my findings are relevant for the latest version as well.

Is there something that I'm missing here? As I see it, this is maybe not technically a bug but still an issue for what I guess is not an uncommon setup.

An easy workaround for us is to set vnode_inactivity_timeout to 5000 (5 to 10 seconds inactivity timeout) but that's to step around the idea of only doing handoffs when the vnode is not busy. Perhaps one should use the background manager (riak_core_bg_manager) for handoffs to achieve that kind of safety but I have not looked further into that. Anyway, the background manager is not used for handoffs by default.


I received a reply from @martinsumner containing the following:

I suspect your analysis is correct though, I don't think the consequences of having this many backends on things like the inactivity timeout have been considered. I would be worried about reducing the inactivity timeout, I think it would be better to change the config of the bitcask_merge_check_interval. This should be changeable now in the riak.conf (although it currently hidden so that isn't obvious - https://github.com/basho/riak_kv/blob/develop-3.0/priv/riak_kv.schema#L749-L762)

He also asked about the reasons of having so many backends and if it's just for the TTL. Here's my reply to that:

This is an old setup that stretches back a few years and I can't justify all the reasons but I can see that we have been using separate backends for specific buckets for full flexibility. Some examples of TTLs are 1 h, 3 h, 24 h and 4 w. We don't even use all of these ones today. After that we have added generic ones like expiry_1h and expiry_24h. Our default backend is LevelDB (we only have one there).

If there would be an easy way to have either bucket or object specific TTL with a single backend there would be nothing stopping us from using that except some time and effort to migrate. I can understand the complexity of testing these multi backend combinations so what you're writing there seems like good news.

References

[1] riak_core_vnode inactivity timeout

riak_core_vnode will send an inactive event to the riak_core_vnode_manager if it has not received any event for a period of 60 to 120 seconds.

Snippets from riak_core_vnode.erl:

-define(DEFAULT_TIMEOUT, 60000).

Timeout = app_helper:get_env(riak_core, vnode_inactivity_timeout, ?DEFAULT_TIMEOUT),
Timeout2 = Timeout + rand:uniform(Timeout),
State#state{... inactivity_timeout=Timeout2 ...}

continue(State) ->
    {next_state, active, State, State#state.inactivity_timeout}.

active(timeout, State=#state{mod=Mod, index=Idx}) ->
    riak_core_vnode_manager:vnode_event(Mod, Idx, self(), inactive),
    continue(State);

[2] riak_core_vnode_manager

Snippets from riak_core_vnode_manager.erl:

handle_vnode_event(inactive, Mod, Idx, Pid, State) ->
    maybe_trigger_handoff(Mod, Idx, Pid, State),
    {noreply, State};
    
maybe_trigger_handoff(Mod, Idx, Pid, _State=#state{handoff=HO}) ->
    case dict:find({Mod, Idx}, HO) of
        {ok, {resize, _Direction, '$resize'}} ->
            ...
        {ok, {delete, local, '$delete'}} ->
            ...
        {ok, {_Type, _Direction, TargetNode}} ->
            riak_core_vnode:trigger_handoff(Pid, TargetNode),
            ok;
        error ->
            ok
    end.

[3] Using sys:log/2 to see messages received by a vnode process

> riak_core_vnode_manager:all_vnodes().
[{riak_kv_vnode,867766597165223607683437869425293042920709947392,
                <0.1059.0>},
 {riak_kv_vnode,1301649895747835411525156804137939564381064921088,
                <0.1060.0>},
 {riak_kv_vnode,1278813932664540053428224228626747642198940975104,
                <0.6896.3>},
 {riak_pipe_vnode,867766597165223607683437869425293042920709947392,
                  <0.3588.0>},
 {riak_pipe_vnode,1301649895747835411525156804137939564381064921088,
                  <0.3597.0>}]

I expected riak_kv_node 1278813932664540053428224228626747642198940975104 to do its handoff and stop.

> Pid = pid(0, 6896, 3).
<0.6896.3>
> sys:log(Pid, true).
ok

% After 30 seconds or so

> sys:log(Pid, get).
{ok,[

     {{in,{timeout,#Ref<0.0.1345.165403>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.217209>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>},
     
     
     {{in,{timeout,#Ref<0.0.1363.129046>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.217575>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>},
     

     {{in,{timeout,#Ref<0.0.1370.78858>,
                   {'$gen_event',{riak_vnode_req_v1,undefined,ignore,
                                                    {backend_callback,#Ref<0.0.2.215724>,merge_check}}}}},
      {<0.6896.3>,active},
      #Fun<gen_fsm.0.131903270>},
     {return,{<0.6896.3>,active},#Fun<gen_fsm.2.131903270>}

    ]}

[4] Scheduling of the merge_check event

Snippets from riak_kv_bitcask_backend.erl:

-define(MERGE_CHECK_INTERVAL, timer:minutes(3)).
-define(MERGE_CHECK_JITTER, 0.3).

schedule_merge(Ref) when is_reference(Ref) ->
    Interval = app_helper:get_env(riak_kv, bitcask_merge_check_interval,
                                  ?MERGE_CHECK_INTERVAL),
    JitterPerc = app_helper:get_env(riak_kv, bitcask_merge_check_jitter,
                                    ?MERGE_CHECK_JITTER),
    Jitter = Interval * JitterPerc,
    FinalInterval = Interval + trunc(2 * rand:uniform() * Jitter - Jitter),
    lager:debug("Scheduling Bitcask merge check in ~pms", [FinalInterval]),
    riak_kv_backend:callback_after(FinalInterval, Ref, merge_check).

Snippet from riak_kv_backend.erl:

%% Queue a callback for the backend after Time ms.
-spec callback_after(integer(), reference(), term()) -> reference().
callback_after(Time, Ref, Msg) when is_integer(Time), is_reference(Ref) ->
    riak_core_vnode:send_command_after(Time, {backend_callback, Ref, Msg}).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions