Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft: ✨ add a first working draft of auto_aggregate selection #33

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

Xasin
Copy link

@Xasin Xasin commented Jun 3, 2023

This PR will add one new file which will create new functions to allow for the execution of an automatic downsampling of data for TimescaleDB hypertables and continuous aggregates.

Functionality of the downsampling method is documented in the utils/aggregate_selector.sql file.
There are no dependencies with other systems, and no internal state of TimescaleDB is modified, as such there should be little conflict with pre-existing systems.

  • Please confirm that the naming of the functions is appropriate for their use case, as I was not sure about existing naming conventions.
  • Please also detail whether or not functions need to be documented in the docs/ folder, or if in-source documentation as provided is sufficient.

@jonatas
Copy link

jonatas commented Jun 7, 2023

Hi @Xasin, very good addition! So happy to see it!

Would you mind adding some example of this in action?

Probably improving the README with some instructions would be good for test, review and future users.

@Xasin
Copy link
Author

Xasin commented Jun 7, 2023

On it!
I'll copy the documentation and add an example section for this function in the docs/ folder, and add a link to the main README

@Xasin
Copy link
Author

Xasin commented Jun 7, 2023

Preliminary docs entry and a link from the README to the docs file added.
I'd appreciate some feedback on the description I am using for the function. I think it's a bit hard for me to understand if its purpose and functionality comes across properly, partly because I know the internals of the function quite well.

aggregate_choices JSONB,
groupby_clause TEXT, filter_query TEXT,
hypertable_schema TEXT DEFAULT 'public',
time_column TEXT DEFAULT 'time')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On thought, would it make sense to add maybe a debug flag and do a raise notice on the actual query that is run?
That way the user could just grab it and run explain analyze and see what is going on. The function ends up hiding that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've noticed that it's rather tricky to figure out what is happening inside a function...

I suppose I could reformat the query a bit to first generate a string and then optionally raise that string as debug flag if a value is set. It's not too much of a change and could definitely really help...

Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have a moment, try the new pushed change.
I hope it works right

@Xasin Xasin marked this pull request as ready for review June 17, 2023 09:54
Comment on lines 209 to 210
RAISE NOTICE 'Generated query output:'
RAISE NOTICE query_construct
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems there's a bug trying to load this lines:

jonatasdp@MacBook-Pro-3 ~/c/t/t/utils (Xasin/master)> psql playground -f auto_downsample.sql
Expanded display is used automatically.
Border style is 2.
Line style is unicode.
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
CREATE FUNCTION
psql:auto_downsample.sql:215: ERROR:  syntax error at or near "RAISE"
LINE 47:   RAISE NOTICE query_construct
           ^
           ```

utils/auto_downsample.sql Outdated Show resolved Hide resolved
Copy link

@jonatas jonatas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying to test it with the following hypertable:

\d ticks
                  Table "toolkit_experimental.ticks"
┌────────┬──────────────────────────┬───────────┬──────────┬─────────┐
│ Column │           Type           │ Collation │ Nullable │ Default │
├────────┼──────────────────────────┼───────────┼──────────┼─────────┤
│ timetimestamp with time zone │           │ not null │         │
│ symbol │ text                     │           │          │         │
│ price  │ numeric                  │           │          │         │
│ volume │ double precision         │           │          │         │
└────────┴──────────────────────────┴───────────┴──────────┴─────────┘
Indexes:
    "ticks_time_idx" btree ("time" DESC)
Triggers:
    ts_insert_blocker BEFORE INSERT ON ticks FOR EACH ROW EXECUTE FUNCTION _timescaledb_internal.insert_blocker()
Number of child tables: 1 (Use \d+ to list them.)

Now, creating the auto downsample:

SELECT *
FROM auto_downsample(
                                'ticks',
                                INTERVAL '10m',
                                $aggregate_options$
                                        [
                                                {"with_columns": ["price"],  "aggregate":"avg(price) AS value"}
                                        ]
                                $aggregate_options$,
                                'symbol',
                                $where_clause$
                                                                WHERE time BETWEEN NOW()-INTERVAL'30d' AND NOW()-INTERVAL'10d'
                                $where_clause$)

                                AS (time TIMESTAMP, symbol text, value DOUBLE PRECISION);
                                

I got the following error:

 Using parameter set <NULL>
ERROR:  No aggregator given!
HINT:  Supply a "aggregate" field in the JSON aggregate object
CONTEXT:  PL/pgSQL function auto_downsample(text,interval,jsonb,text,text,text,text,boolean) line 21 at RAISE

Why can't I get it with a single aggregates? Maybe I'm missing something or my context is not suitable for this 🤔

@Xasin
Copy link
Author

Xasin commented Jun 19, 2023

@jonatas I think the problem there is that you are using a different schema than the "public" schema. The schema name can be given via the hypertable_schema named parameter.

I am assuming "public" is the default schema, but perhaps there is a better way, e.g. using the currently "active" schema, if such a thing exists?

What caught me off guard is that it's raising the wrong error. I thought I had a guard in check that specifically detects when it can't find a fitting table and give an appropriate error message, but it seems that it didn't work.
I tweaked it so it now correctly says when it can't find the hypertable, the error message looks like this:

ERROR:  No fitting hypertable or aggregate found for given columns and table "public"."kernel"!
HINT:  Make sure the 'with_columns' field of the aggregate options is correct, and that `hypertable` and `hypertable_schema` are correct!

It seems to work fine on a Hypertable with no continuous aggregates for me.

@jonatas
Copy link

jonatas commented Jun 19, 2023

Thanks for the fix @Xasin.

Just trying it again:

CREATE TABLE ticks
( time TIMESTAMP NOT NULL,
    symbol varchar,
    price decimal,
    volume int);
SELECT create_hypertable('ticks', 'time');

And then:

SELECT *
FROM auto_downsample(
                                'ticks',
                                INTERVAL '10m',
                                $aggregate_options$
                                        [
                                                {"with_columns": ["price"],  "aggregate":"avg(price) AS value"}
                                        ]
                                $aggregate_options$,
                                'symbol',
                                $where_clause$
                                                                WHERE time BETWEEN NOW()-INTERVAL'30d' AND NOW()-INTERVAL'10d'
                                $where_clause$)

                                AS (time TIMESTAMP, symbol text, value DOUBLE PRECISION);

Now, after reloading I have the following error:

ERROR:  function auto_downsample(unknown, interval, unknown, unknown, unknown) does not exist
LINE 2: FROM auto_downsample(
             ^
HINT:  No function matches the given name and argument types. You might need to add explicit type casts.

@Xasin
Copy link
Author

Xasin commented Jun 19, 2023

@jonatas very strange, I am getting an error as well, but it's quite different.

When I run your commands 1:1 in my psql prompt (on a PostgreSQL 13.11 server), the error message is the following:

NOTICE:  Using parameter set {"aggregate": "avg(price) AS value", "table_name": "ticks", "table_schema": "public", "with_columns": ["price"], "table_interval": "00:00:00", "interval_matched": false}
ERROR:  structure of query does not match function result type
DETAIL:  Returned type character varying does not match expected type text in column 2.
CONTEXT:  PL/pgSQL function auto_downsample(text,interval,jsonb,text,text,text,text,boolean) line 42 at RETURN QUERY

But after fixing those by replacing the AS () with the proper types (i.e. CHARACTER VARYING and NUMERIC) it actually works fine for me.
Maybe different PostgreSQL versions pull the AS ( ) statement into the function definition itself and throw a less understandable error on return type missmatch?

@jonatas
Copy link

jonatas commented Jun 20, 2023

I'm on Postgresql 14.7. Maybe we could leave a complete example running in the examples, just to allow people to try to replicate it and then adapt to their needs. Can you share the changes you did just to make sure I'm following all the replaces in the AS ( ) that you mentioned?

@Xasin
Copy link
Author

Xasin commented Jun 20, 2023

@jonatas the edited "AS" clause has to have the exact same types as the return types of the query, so it looks like:
AS (time TIMESTAMP, symbol character varying, value numeric);
for the case you gave me. varchar and text are different, and the aggregation functions seem to return numeric.

I also confirmed that I have no other typedef of the function that might be hiding the issue, so I'm sure I only have the auto_downsample function version of the provided SQL file.

I'll update a fully working example in a moment.

Copy link

@jonatas jonatas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all your help here! very good to have this helper around 🚀 🙇

Tested manually ✅

@Xasin
Copy link
Author

Xasin commented Jun 20, 2023

I'm really happy to help!
It might come in quite handy at some of my work soon, which is very exciting. In any case, I am more than happy to give back to such a helpful too.

I don't think there's anything else from me left to change, so unless there's something else that's needed I think this can be merged in.

@jonatas jonatas requested a review from noctarius June 21, 2023 22:14
@jonatas
Copy link

jonatas commented Jun 21, 2023

Amazing! Thanks again! Adding Chris to have a second 👀 and if @d-sooter has any feedback let us know before we merge it.

@nimbit-software
Copy link

So i played around a bit with the function and i love it. And will be using it :) One issue i had is that i join my hypertable/Cagg on a metadata table. And since i dont know which table will be selected i cant really perform my join.

For my situation i just added a fixed alias to the table so that it can be referenced if needed in a join but im not sure if it would make sense to allow an alias to be passed for the table or to have a fixed alias

@Xasin
Copy link
Author

Xasin commented Jun 26, 2023

@nimbit-software hm... That is an interesting problem. Would it help if I aliased the selected table to a common name inside the internal query, so that it's consistently referenceable for JOINs?

For now, can you try something for me?
The way the WHERE clause query is injected into the SQL it's appended directly after the FROM clause, so perhaps if you start the filter clause with a AS table_name, the AS will be appended directly to the FROM table to give it an alias?

EDIT: Maybe that's what you already did.
I think internally assigning a fixed table alias is a cleaner solution.

@d-sooter
Copy link

d-sooter commented Jun 26, 2023

so basically this is what i did.

i set a static alias "timeseries" and then i can perform my join. I can imagine cases where people would want to pass in their own alias, but then we have another input parameter and its starts getting a bit full.

query_construct := format($qry$ SELECT time_bucket(%L, %I) AS time, %s, %s FROM %I.%I as timeseries %s GROUP BY 1, %s ORDER BY 1 $qry$, data_interval, time_column, groupby_clause, aggregator_column, selected_parameters->>'table_schema', selected_parameters->>'table_name', filter_query, groupby_clause);

we could check the table name and if there is a space in it then split it up and set the alias.

CREATE OR REPLACE FUNCTION auto_downsample(hypertable TEXT, data_interval INTERVAL,
	aggregate_choices JSONB, 
	groupby_clause TEXT, filter_query TEXT,
	hypertable_schema TEXT DEFAULT 'public',
	time_column TEXT DEFAULT 'time',
	debug_query BOOLEAN DEFAULT FALSE)
RETURNS SETOF RECORD
LANGUAGE plpgsql
STABLE
AS $BODY$
DECLARE
	selected_parameters jsonb;
	aggregator_column TEXT;
	query_construct TEXT;
	hypertable_internal text;
  	hypertable_alias text;
  	parts text[];
BEGIN
	  IF hypertable ~* '\s+AS\s+' THEN
    	parts := regexp_split_to_array(hypertable, '\s+AS\s+', 'i');
	  ELSE
		parts := regexp_split_to_array(hypertable, '\s+', 'i');
	  END IF;

	  -- Extract the table name and alias
	  hypertable_internal := parts[1];
	  hypertable_alias := parts[array_length(parts, 1)];



	SELECT aggregate_choice(hypertable_internal, data_interval, aggregate_choices, hypertable_schema) INTO selected_parameters;

	IF selected_parameters IS NULL THEN
		RAISE EXCEPTION $$No fitting hypertable or aggregate found for given columns and table "%"."%"!$$,
			hypertable_schema, hypertable_internal
			USING HINT = $$Make sure the 'with_columns' field of the aggregate options is correct, and that `hypertable` and `hypertable_schema` are correct!$$
		RETURN;
	END IF;

	RAISE NOTICE 'Using parameter set %', selected_parameters;

	aggregator_column := selected_parameters->>'aggregate';

	IF aggregator_column IS NULL THEN
		RAISE EXCEPTION 'No aggregator given!' USING HINT = 'Supply a "aggregate" field in the JSON aggregate object'
		RETURN;
	END IF;

	query_construct := format($qry$
			SELECT time_bucket(%L, %I) AS time, %s, %s
			FROM %I.%I as %s
			%s
			GROUP BY 1, %s
			ORDER BY 1
		$qry$, data_interval, time_column, groupby_clause, aggregator_column,
		selected_parameters->>'table_schema', selected_parameters->>'table_name', hypertable_alias,
		filter_query,
		groupby_clause);

	IF debug_query THEN
		RAISE NOTICE 'Generated query output: %', query_construct;
	END IF;

	RETURN QUERY EXECUTE query_construct;
END;
$BODY$;

@Xasin
Copy link
Author

Xasin commented Jun 26, 2023

@d-sooter that doesn't sound too bad.
"timeseries" should be OK, but it almost sounds like a potential internal keyword?

It doesn't actually matter how many parameters the function has. PLPGSQL functions support named parameters, where individual parameters can be accessed through e.g. some_function(parameter_a => 10, parameter_c => 32)
This keeps it fairly readable even with higher parameter counts, and if there is a "sane default" that most people will use (like how the default schema is public and the default time column name is time), it's not that bad to add more arguments.

I don't feel comfortable with the RegExp solution, it can get a bit messy and harder to understand than an explicit parameter like that.

@nimbit-software
Copy link

I had one last idea. I already implemented it in my system and it works good. That is gapfill

I added a parameter to the function

bucket_function TEXT DEFAULT 'time_bucket',

and then just updated the query

          query_construct := format($qry$
                        SELECT %I(%L, %I) AS time, %s, %s
                        FROM %I.%I AS %I
                        %s
                        GROUP BY 1, %s
                        ORDER BY 1
                    $qry$, bucket_function, data_interval, time_column, groupby_clause, aggregator_column,
                    selected_parameters->>'table_schema', selected_parameters->>'table_name',
                    table_alias,
                    filter_query,
                    groupby_clause);

Since the where is passed it it makes it easy to use the interpolation or locf functions as well.

I tested it and it works like a charm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants