You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I realized that several users are confused by create_distributed_function and its purpose. I think the main confusion comes from this: create_distributed_table makes things faster, so let's try create_distributed_function to make functions faster. Also, they refer to this blog and probably skimming through it and conclude it makes things faster.
It'd be really nice if we can improve the documentation on this topic so that users do not try to distribute functions unnecessarily / with wrong expectations.
(below copy & paste from an internal e-mail)
What are the typical use cases?
There 3 two main use cases for distributed functions, and as far as I can tell, the function you try to distribute does not fall into any of the categories.
The primary use case for distributed functions with distribution_arg_name and colocate_with parameters:
It is mostly useful if you need to scale OLTP workloads that are written as stored procedures.
Say, you have the following stored procedure where all the commands are scoped to per tenant/distribution key:
CREATEFUNCTIONprocedure_to_pushdown (int dist_key, …)
BEGIN;
INSERT INTO dist_table (dist_key, …) .. VALUES ($1_;
UPDATE dist_table … WHERE dist_key = $1;
DELETEFROM collocated_dist_table WHEREWHERE dist_key = $1;
…
COMMIT;
end function;
If you do not create distributed functions, each statement will become 1 round-trip to the worker node from the coordinator.
Instead, if you convert this function to a distributed function by providing relevant parameters, the procedure call will be routed to the correct node (e.g., we can deduct it from the distribution parameter of the function).
With that, Citus needs only 1 round-trip to the worker node per procedure call. This might save a lot of round-trips, especially if the function has lots of commands in it.
And then, you get all the benefits described in this blog
Secondary use-case for distributed function:
If you have very simple functions that DO NOT return rows, but can only be used in the SELECT clause. Say you have this value_plus_one() function. If you use this function on a distributed query, it’ll use this function on the shards and you’ll get an error (with some logs enabled):
SELECT value_plus_one(key) FROM distributed_table;
….
NOTICE: issuing SELECTpublic.value_plus_one(a) AS value_plus_one FROMpublic.test_102012 test WHERE true
ERROR: function public.value_plus_one(integer) does not exist
HINT: No function matches the given name and argument types. You might need to add explicit type casts.
CONTEXT: while executing command on localhost:9705
As a convenience to be able to create the functions on the workers, you can use create_distributed_function without distribution arguments:
I realized that several users are confused by
create_distributed_function
and its purpose. I think the main confusion comes from this:create_distributed_table
makes things faster, so let's trycreate_distributed_function
to make functions faster. Also, they refer to this blog and probably skimming through it and conclude it makes things faster.It'd be really nice if we can improve the documentation on this topic so that users do not try to distribute functions unnecessarily / with wrong expectations.
(below copy & paste from an internal e-mail)
What are the typical use cases?
There 3 two main use cases for distributed functions, and as far as I can tell, the function you try to distribute does not fall into any of the categories.
It is mostly useful if you need to scale OLTP workloads that are written as stored procedures.
Say, you have the following stored procedure where all the commands are scoped to per tenant/distribution key:
If you do not create distributed functions, each statement will become 1 round-trip to the worker node from the coordinator.
Instead, if you convert this function to a distributed function by providing relevant parameters, the procedure call will be routed to the correct node (e.g., we can deduct it from the distribution parameter of the function).
SELECT create_distributed_function(' procedure_to_pushdown (int)', ‘dist_key’, collocate_with:‘distributed_table’);
With that, Citus needs only 1 round-trip to the worker node per procedure call. This might save a lot of round-trips, especially if the function has lots of commands in it.
And then, you get all the benefits described in this blog
If you have very simple functions that DO NOT return rows, but can only be used in the SELECT clause. Say you have this value_plus_one() function. If you use this function on a distributed query, it’ll use this function on the shards and you’ll get an error (with some logs enabled):
As a convenience to be able to create the functions on the workers, you can use
create_distributed_function
without distribution arguments:With that, the function will be available on the worker, and the error can be avoided:
In other words, this is mostly equivalent for running CREATE FUNCTION .. commands on the workers, no magic happens.
/* todo: fill up later */
The text was updated successfully, but these errors were encountered: