Skip to content

Add Couch Stats Resource Tracker (CSRT) #5491

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

chewbranca
Copy link
Contributor

@chewbranca chewbranca commented Mar 25, 2025

Couch Stats Resource Tracker (CSRT)

Couch Stats Resource Tracker (CSRT) is a framework for tracking the metrics
induced in couch_stats at the process level to understand what requests and
operations are utilizing the underlying system resources. The metrics in
couch_stats are currently tracked at the node level allowing you to see the
quantitative aspects of node resource consumption, but lack the qualitative
information to understand what induced those workloads. CSRT rectifies this by
way of process local real time metrics collection exposed in a queryable
interface and also by way of generating process lifetime level reports
documenting the total quantity and time of work induced by a given request.

This PR takes a different approach than mango_stats and other facilities that
look to embed additional metrics that are not present in couch_stats; rather,
the approach here is inverting this notion with the core idea that: if it's
worth tracking the consumption of a specific resource then that resource is
worth tracking by way of dedicated couch_stats metrics. So when the need for
additional data collection arises, we should add those as node level stats in
couch_stats and then track those at the process level by way of CSRT.

This is a rework of PR: #4812

This is a singular PR comprised of a few core components:

  1. the underlying real time stats collection framework
  2. the mechanisms for tracking the resource operations
  3. the querying and logging facilities

The distinction is drawn between these components to highlight that the core
system here is the underlying process level real time stats collection framework
that allows tracking, querying, and logging of resource usage induced by
requests to and operations within CouchDB. The current requests and operations
tracked along with the mechanisms for querying and logging are a solid start but
by no means complete and the idea is to build upon the core framework of 1) to
introspect resource usage such that a user of CouchDB can see what requests and
operations are comprising their system usage.

Understanding what requests utilize the given resources is incredibly important
for aggregate operations that require consuming far more resources than they
manifest back in the client response, as we currently lack vision into these
workload discrepancies. For instance, if you have a _find query that falls
back to the all docs index and is a query for which no rows will be found, that
singular http request will induce a full database scan of all documents, painful
enough on its own but magnified greatly when induced in parallel. These types of
requests usually manifest at the client level as timeouts, and within the
databse as heavy IO reader proceses that can thundering herd database shards.
Filtered changes feeds are similar, but even worse in that they funnel the full
database scan through the Javascript engine pipeline. At least with views the
workload induced on the system is roughly similar to what's returned to the
client, which provides at least some mechanism for the user to understand what's
going on; eg if you query 70 million rows from a view in a singular http request
that's something that you'll be able to see and realize it's obviously
problematic. CSRT resolves these issues by logging heavy duty requests as
reports allowing for post facto analysis of what the database was doing over a
given time period, and also a real time querying system for finding out what the
hot requests are right now.

1) the underlying real time stats collection framework

The previous PR details out a few different approaches that failed to scale
effectively during my testing. Instead, the approach in this PR builds upon the
signifcant improvents in Erlang made to ETS around atomic updates and
decentralized counters, combined with an approach that performs no concurrent
writes to the same key as all tracked processes directly update the central
stats table by way of ets:update_counter which performs atomic and isolated
updates. This approach combined with read_concurrency and write_concurrency
allows for a highly concurrent data collection mechanism that still allows real
time querying. It's easy enough to track the process local stats and generate a
report at the end of the process lifetime, but given that workloads induced from
a request can potentially last for hours, it's critical to have a queryable real
time mechanism that allows for introspection of these long running tasks,
especially if we want to incorporate further information from long running
background jobs like indexing, replication, and compaction.

2) the mechanisms for tracking the resource operations

CSRT hooks into couch_stats:increment_counter from within the caller process
and in the event the metric counter being incremented is one tracked by CSRT we
then update the ETS entry for the caller process for the given stat. This
greatly simplifies the approach of stats collections, avoids having a
centralized set of processes gathering and publishing stats, and avoids
concurrent writes to the same key given each process tracks itself.

As mentioned above, the key of this PR is the core framework for stats
collection and funneling of data between nodes with a limited subset of
operations being tracked by way of CSRT. Currently we track coordinator
processes as the http request flows into chttpd, then we track RPC workers by
way of rexi_server:init_p. As the RPC worker processes send messages back by
way of rexi they also embed a snapshot of the current workload delta since the
last rexi message sent, allowing for the coordinator process to accumulate and
reflect the full workload induced, which is then queryable. This mechanism
allows for rexi:ping to keep deltas flowing, so even when RPC workers are not
sending data back to the client, they will funnel along their CSRT stats so we
can easily find the heavy processes. This PR intentionally keeps the focus to
the coordinator processes and rexi_server based RPC workers, but the
expectation is that this tracking system can be utilized for background job
workers and the other RPC subsystems like in dreyfus and mem3.

We specifically track RPC workers spawned by way of rexi_server:init_p, but
we're less specific about the underlying RPC operation called, and rather, we
track whether the RPC process induces metrics increments for one of the metrics
CSRT is interested in tracking. This goes back to the philosophy of CSRT, which
is to track the important things we already track in couch_stats, so rather
than specifically counting the number of documents opened in an _all_docs
requests, we instead track the stat [couchdb, database_reads] for all RPC
workers induced by rexi_server:init_p and then we can find out what the actual
request that induced the workload was by way of the generated log report.

3) the querying and logging facilities

There's a rudimentary http based querying system that allows for counting,
grouping, and sorting on different fields of the real time values of the running
processes. There's also a process lifetime logging facility that allows for
customizable filtering of what process lifetimes to log. Both of these are
capable but rough and could use help.

The CSRT logger utilizes ets:fun2ms to create powerful and performant
filtering functions to determine whether to log the process lifetime report of a
process once it has concluded its operations. These filter match specs are
compiled and saved as a persistent term to be readily available to all processes
in the system, which allows tracking processes to monitor the lifetime of
coordinators or worker processes and upon their exit load the precompiled match
spec and locally determine whether or not to generate a log. This distributed
approach avoids centralized process trackers, a major overload failure mode of
previous experiments, while utilizing the significant performance gains of
persistent_term and even avoids incurring heavy copying of values into the
local caller process as the compiled match specs returned out of persistent term
are actually just refs to an internal representation!

Currently there are a handful of builtin matchers that are configurable by way
of the ini files, for example, filters on requests that perform more than X
docs read, or IOQ calls, or docs written. There's also a direct matcher for
dbname, or a more specific dbname IO matcher that matches when a request to the
provided dbname does more than X ops to any of the core IO metrics.

These are some baseline matchers that I thought would be useful, but it would be
great to get ideas from more folks on what they would find useful. These
matchers are easy to add as functions or dynamically by way of remsh, but I
haven't come up with a great way to declare the more complicated config chains
in ini files. I'm hoping other folks might have some ideas on this front, for
example, the following matcher is easy to write and dynamically register to
perform filters, but I couldn't come up with a great approach to allow for
specifying these types of complex queries in an ini file:

ets:fun2ms(fun(#rctx{dbname=<<"foo/bar">>, ioq_calls=IOQ, docs_read=Docs,
rows_read=Rows, type=#worker{}} = R) when (IOQ > 1234) or (Docs > 203) or (Rows >
500) -> R end).

The csrt_logger server will allow you to dynamically register match specs like
that to then generate process lifetime reports for processes that exit and whose
workloads matched those filter thresholds. This same matcher can also be run
directly against the ETS table to allow for sophisticated real time filtering by
way of match specs. Or you can even run the matcher against a list of #rctx{}
values, which is basically what csrt_logger:tracker does.

I'm quite excited for the dynamic logging and introspection capabilities
afforded by the match spec filtering mechanisms, but I am scratching my head a
bit coming up with a good approach for the combinatorics of different filter
pairings of the various fields. I've added a handful of builtin matchers in this
PR, and I think we can provide a useful set of "slow query" type basic filters
from the get go. Hopefully someone can come up with an expressive way of
chaining the matcher specs, but might be challenging given it's doing parse
transforms and what not. That said, it's entirely possible to write some fairly
verbose matchspec funs for various more complex matchers we want, I don't think
that's a problem for useful filters we expect to stick around.

TODO

Add some concluding thoughts, a few examples of output, and some requested
review dissussion points.

end.

-spec tracker(PidRef :: pid_ref()) -> ok.
tracker({Pid, _Ref}=PidRef) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick.

It would be best to minimize the number of places where we assume a specific structure. Since leaking implementation details and relying on them makes code harder to update. How about introducing a helper in csrt_util or csrt_sever?

tracker(PidRef) ->
   MonRef = csrt_server:monitor_pid_ref(PidRef),
   ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking maybe PidRef is not the best name. Since it implies specific implementation. Maybe rename all instances of PidRef to ResourceId (this name make sense since we have csrt_server:get_resource(PidRef)) and do not rely on the structure.

{noreply, State, 0}.

handle_info(restart_config_listener, State) ->
ok = config:listen_for_changes(?MODULE, nil),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have a helper fun for that subscribe_changes().

ok = subscribe_changes(),
{ok, #st{}}.

handle_call({register, Name, MSpec}, _From, #st{matchers=Matchers}=St) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see where we are sending messages to this clause. If you intend to use it outside of this module add the corresponding access function. I mean something like

-module(csrt_logger).

-export([
     register_matcher/2,
 ]).

register_matcher(Name, MSpec) -> 
   gen_server:call(?MODULE, {register, Name, MSpec}, infinity). 

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After thinking about this clause little more I don't think this approach would work. Because we are not using the same ground of true. When we go via initilize_matchers() we put them into persistent term. However we do not update the #st.matchers used in this clause. Which mean they would go out of sync.

%% csrt_server:create_resource(Rctx).

-spec create_worker_context(From, MFA, Nonce) -> pid_ref() | false when
From :: pid_ref(), MFA :: mfa(), Nonce :: term().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nonce :: nonce() since it is defined in couch_stats_resource_tracker.hrl and we include it in this module.

set_context_type(Type, PidRef) ->
update_element(PidRef, [{#rctx.type, Type}]).

-spec create_resource(Rctx :: rctx()) -> true.
Copy link
Contributor

@iilyak iilyak Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec doesn't account for catch. The easiest fix would be

-spec create_resource(Rctx :: rctx()) -> boolean().

create_resource(#rctx{} = Rctx) ->
    (catch ets:insert(?MODULE, Rctx)) == true.

-spec destroy_resource(PidRef :: maybe_pid_ref()) -> boolean().
destroy_resource(undefined) ->
false;
destroy_resource({_,_}=PidRef) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec doesn't account for catch.

get_resource(undefined) ->
undefined;
get_resource(PidRef) ->
catch case ets:lookup(?MODULE, PidRef) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The spec doesn't account for catch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The try/catch would be better here. Since it would

  1. handle the catch case to conform to the type spec
  2. be more efficient since beam wouldn't be calling get_stacktrace() see here.

Always prefer try/catch if you anticipate that the exception would happen often. I think this is the case for get_resource method.

is_rctx_field(Field) ->
maps:is_key(Field, ?KEYS_TO_FIELDS).

-spec get_rctx_field(Field :: rctx_field()) -> non_neg_integer().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function might throw an error bad key. So technically the type spec is wrong. However I do understand that we use macro definitions for field names so the chance of a typo is quite low. If you decide to update the spec here is what I suggest.

%% maybe define this type in the couch_stats_resource_tracker.hrl file
-type throw(_Reason) :: no_return().

-spec get_rctx_field(Field :: rctx_field()) -> non_neg_integer() 
    | throw({badkey, Key :: any()}).

case is_rctx_field(Field) of
true ->
Update = {get_rctx_field(Field), Count},
catch ets:update_counter(?MODULE, PidRef, Update, #rctx{pid_ref=PidRef});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we cannot guaranty that the return value is always correct. Because ets:update_counter might fail. I did check all callers of update_counter (inc, docs_written, ioq_called, js_filtered) and in all instances we are not using the result. Maybe we should change the return type to be ok atom.

0
end.

-spec inc(PidRef :: maybe_pid_ref(), Field :: rctx_field()) -> non_neg_integer().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider changing result type to be ok.

inc(PidRef, Field) ->
inc(PidRef, Field, 1).

-spec inc(PidRef, Field, N) -> non_neg_integer() when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing result type to be ok

update_element(undefined, _Update) ->
false;
update_element({_Pid,_Ref}=PidRef, Update) ->
%% TODO: should we take any action when the update fails?
Copy link
Contributor

@iilyak iilyak Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interesting. The failure would happen when csrt_server is restarted and not yet started. In this case the ets table would be missing. So we wouldn't be able to retrieve the information (context type and db name would be missing in rtcx, most importantly context type) latter on. This will cause following problems:

%%[R || #rctx{} = R <- ets:match_object(?MODULE, #rctx{pid_ref={Pid, '_'}, _ = '_'})].
[R || R <- ets:match_object(?MODULE, #rctx{pid_ref = {Pid, '_'}})].

find_by_pidref(PidRef) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PidRef would be renamed this would need to be renamed as well.

%%[R || R <- ets:match_object(?MODULE, #rctx{pid_ref=PidRef, _ = '_'})].
[R || R <- ets:match_object(?MODULE, #rctx{pid_ref = PidRef})].

find_workers_by_pidref(PidRef) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PidRef would be renamed this would need to be renamed as well.

convert_type(undefined) ->
null.

-spec convert_pidref(PidRef) -> binary() | null when
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PidRef would be renamed this would need to be renamed as well.

false;
update_handler_fun(Mod, Func, PidRef) ->
Rctx = get_resource(PidRef),
%% TODO: #coordinator{} assumption needs to adapt for other types
Copy link
Contributor

@iilyak iilyak Apr 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have to handle undefined here (assuming the csrt_server:get_resource/1 is fixed to return undefined instead of {`EXIT`, badarg} when ets table is missing).

#coordinator{} = Coordinator0 = csrt_server:get_context_type(Rctx),
Coordinator = Coordinator0#coordinator{mod = Mod, func = Func},
csrt_server:set_context_type(Coordinator, PidRef),
ok.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type spec of the update_handler_fun declares return type as boolean(), please change ok to true or update spec.

-spec set_context_username(UserName, PidRef) -> boolean() when
UserName :: username(), PidRef :: maybe_pid_ref().
set_context_username(_, undefined) ->
ok;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type defined as boolean()

%% Stat collection API
%%

-spec inc(Key :: rctx_field()) -> non_neg_integer().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing the return type to ok.



-spec maybe_inc(Stat :: atom(), Val :: non_neg_integer()) -> non_neg_integer().
maybe_inc(Stat, Val) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider changing the return type to ok


-spec maybe_inc(Stat :: atom(), Val :: non_neg_integer()) -> non_neg_integer().
maybe_inc(Stat, Val) ->
case maps:is_key(Stat, ?STATS_TO_KEYS) of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we are traversing the map two times. How about.

case maps:get(Stat, STATS_TO_KEYS, not_rtcx_field) of
    not_rtcx_field -> 
         0;
    Key -> 
         inc(Key, Val)
end

Key = fabric_conf_key(Func),
ok = config:set_boolean(?CSRT_INIT_P, Key, Enabled, Persist).

-spec fabric_conf_key(Key :: atom()) -> string().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fabric_conf_key is going to be on the hot path. Maybe we should optimize it. Consider adding -compile({inline, [{fabric_conf_key, 1}]}).

-spec fabric_conf_key(Key :: atom()) -> string().
fabric_conf_key(Key) ->
%% Double underscore to separate Mod and Func
"fabric_rpc__" ++ atom_to_list(Key).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be a great place to optimize in the future. Maybe add a TODO. One strategy is to implement the set_fabric_conf_key and call it from set_fabric_init_p. The implementation would store the mapping of keys to "fabric_rpc__" in persistent term.

ok = config:set(
"csrt_logger.dbnames_io", "foo/bar", integer_to_list(?THRESHOLD_DBNAME_IO), false
),
csrt_logger:reload_matchers(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

update_element(undefined, _Update) ->
false;
update_element({_Pid, _Ref} = PidRef, Update) ->
%% TODO: should we take any action when the update fails?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least convert {EXIT, badarg} to false to conform to spec (or update the spec).

]).

%%
%% PidRef operations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PidRef would be renamed all functions in this section would need to be renamed

find_by_pid(Pid) ->
csrt_query:find_by_pid(Pid).

find_by_pidref(PidRef) ->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If PidRef would be renamed this would need to be renamed as well.


find_by_pid(Pid) ->
%%[R || #rctx{} = R <- ets:match_object(?MODULE, #rctx{pid_ref={Pid, '_'}, _ = '_'})].
[R || R <- ets:match_object(?MODULE, #rctx{pid_ref = {Pid, '_'}})].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a second place where we use internal representation. One way to improve it would be:

%% couch_stats_resource_tracker.hrl
-define(resource_id_pid(Pid), {Pid, '_'}).

%% csrt_query.erl
[R || R <- ets:match_object(?MODULE, #rctx{pid_ref = ?resource_id_pid(Pid)})].

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the code correctly there is a mistake in here. I couldn't find an ets table named csrt_query. Most likely this should be csrt_server instead of ?MODULE. If I am right then you can just move this matcher implementation into csrt_server.erl and have a dispatch here.

%% csrt_query.erl
find_by_pid(Pid) ->
     csrt_server:find_by_pid(Pid).

%% csrt_server.erl
find_by_pid(Pid) ->
    [R || R <- ets:match_object(?MODULE, #rctx{pid_ref = {Pid, '_'}})].

select_by_type(coordinators) ->
ets:select(?MODULE, ets:fun2ms(fun(#rctx{type = #coordinator{}} = R) -> R end));
select_by_type(workers) ->
ets:select(?MODULE, ets:fun2ms(fun(#rctx{type = #rpc_worker{}} = R) -> R end));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe csrt_server instead of ?MODULE.

select_by_type(workers) ->
ets:select(?MODULE, ets:fun2ms(fun(#rctx{type = #rpc_worker{}} = R) -> R end));
select_by_type(all) ->
ets:tab2list(?MODULE).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe csrt_server instead of ?MODULE.


find_by_nonce(Nonce) ->
%%ets:match_object(?MODULE, ets:fun2ms(fun(#rctx{nonce = Nonce1} = R) when Nonce =:= Nonce1 -> R end)).
[R || R <- ets:match_object(?MODULE, #rctx{nonce = Nonce})].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe csrt_server instead of ?MODULE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add helper to csrt_server.erl and call it from this module.

% csrt_server.erl
% the name could be match_context ???
match_resource(#rctx{} = Rctx) ->
    ets:match_object(?MODULE, Rctx).

% csrt_query.erl
find_by_nonce(Nonce) ->
    csrt_server:match_resource(#rctx{nonce = Nonce}).


find_by_pid(Pid) ->
%%[R || #rctx{} = R <- ets:match_object(?MODULE, #rctx{pid_ref={Pid, '_'}, _ = '_'})].
[R || R <- ets:match_object(?MODULE, #rctx{pid_ref = {Pid, '_'}})].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe csrt_server instead of ?MODULE.


find_by_pidref(PidRef) ->
%%[R || R <- ets:match_object(?MODULE, #rctx{pid_ref=PidRef, _ = '_'})].
[R || R <- ets:match_object(?MODULE, #rctx{pid_ref = PidRef})].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe csrt_server instead of ?MODULE.


find_by_nonce(Nonce) ->
%%ets:match_object(?MODULE, ets:fun2ms(fun(#rctx{nonce = Nonce1} = R) when Nonce =:= Nonce1 -> R end)).
[R || R <- ets:match_object(?MODULE, #rctx{nonce = Nonce})].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need list comprehension here? I think the ets:match_object(?MODULE, #rctx{nonce = Nonce}) without [R || R <- ....] would do the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants