-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path.cursorrules
265 lines (231 loc) · 19.4 KB
/
.cursorrules
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
You are an expert in SQL and Tinybird. Follow these instructions when working with .datasource and .pipe files:
<command_calling>
You have commands at your disposal to develop a tinybird project:
- tb build: to build the project locally and check it works.
- tb deployment create --wait --auto: to create a deployment and promote it automatically
- tb test run: to run existing tests
- tb --build endpoint url <pipe_name>: to get the url of an endpoint, token included.
- tb --build endpoint data <pipe_name>: to get the data of an endpoint. You can pass parameters to the endpoint like this: tb --build endpoint data <pipe_name> --param1 value1 --param2 value2
- tb --build token ls: to list all the tokens
There are other commands that you can use, but these are the most common ones. Run `tb -h` to see all the commands if needed.
When you need to work with resources or data in the Tinybird environment that you updated with the build command, add always the --build flag before the command. Example: tb --build datasource ls
When you need to work with resources or data in cloud, add always the --cloud flag before the command. Example: tb --cloud datasource ls
</command_calling>
<development_instructions>
- When asking to create a tinybird data project, if the needed folders are not already created, use the following structure:
├── copies
├── datasources
├── endpoints
├── fixtures
├── materializations
├── sinks
└── tests
- The local development server will be available at http://localhost:80. Even if some response uses another base url, use always http://localhost:80.
- After every change in your .datasource, .pipe or .ndjson files, run `tb build` to build the project locally.
- When you need to ingest data locally in a datasource, create a .ndjson file with the same name of the datasource and the data you want and run `tb build` so the data is ingested.
- The format of the generated api endpoint urls is: http://localhost:80/v0/pipe/<pipe_name>.json?token=<token>
- Before running the tests, remember to have the project built with `tb build` with the latest changes.
</development_instructions>
When asking for ingesting data, adding data or appending data do the following depending on the environment you want to work with:
<ingest_data_instructions>
- When building locally, create a .ndjson file with the data you want to ingest and do `tb build` to ingest the data in the build env.
- We call `cloud` the production environment.
- When appending data in cloud, use `tb --cloud datasource append <datasource_name> <file_name>`
- When you have a response that says “there are rows in quarantine”, do `tb --build|--cloud datasource data <datasource_name>_quarantine` to understand what is the problem.
</ingest_data_instructions>
<datasource_file_instructions>
Follow these instructions when creating or updating .datasource files:
<datasource_file_instructions>
- Content cannot be empty.
- The datasource names must be unique.
- No indentation is allowed for property names: DESCRIPTION, SCHEMA, ENGINE, ENGINE_PARTITION_KEY, ENGINE_SORTING_KEY, etc.
- Use MergeTree engine by default.
- Use AggregatingMergeTree engine when the datasource is the target of a materialized pipe.
- Use always json paths to define the schema. Example: `user_id` String `json:$.user_id`,
</datasource_file_instructions>
Follow these instructions when creating or updating .pipe files:
<pipe_file_instructions>
- The pipe names must be unique.
- Nodes do NOT use the same name as the Pipe they belong to. So if the pipe name is "my_pipe", the nodes must be named different like "my_pipe_node_1", "my_pipe_node_2", etc.
- Nodes can't have the same exact name as the Pipe they belong to.
- Avoid more than one node per pipe unless it is really necessary or requested by the user.
- No indentation is allowed for property names: DESCRIPTION, NODE, SQL, TYPE, etc.
- Allowed TYPE values are: endpoint, copy, materialized, sink
- Add always the output node in the TYPE section.
</pipe_file_instructions>
<sql_instructions>
- The SQL query must be a valid ClickHouse SQL query that mixes ClickHouse syntax and Tinybird templating syntax (Tornado templating language under the hood).
- SQL queries with parameters must start with "%" character and a newline on top of every query to be able to use the parameters. Examples:
<invalid_query_with_parameters_no_%_on_top>
SELECT * FROM events WHERE session_id={{String(my_param, "default_value")}}
</invalid_query_with_parameters_no_%_on_top>
<valid_query_with_parameters_with_%_on_top>
%
SELECT * FROM events WHERE session_id={{String(my_param, "default_value")}}
</valid_query_with_parameters_with_%_on_top>
- The Parameter functions like this one {{String(my_param_name,default_value)}} can be one of the following: String, DateTime, Date, Float32, Float64, Int, Integer, UInt8, UInt16, UInt32, UInt64, UInt128, UInt256, Int8, Int16, Int32, Int64, Int128, Int256
- Parameter names must be different from column names. Pass always the param name and a default value to the function.
- Use ALWAYS hardcoded values for default values for parameters.
- Code inside the template {{template_expression}} follows the rules of Tornado templating language so no module is allowed to be imported. So for example you can't use now() as default value for a DateTime parameter. You need an if else block like this:
<invalid_condition_with_now>
AND timestamp BETWEEN {DateTime(start_date, now() - interval 30 day)} AND {DateTime(end_date, now())}
</invalid_condition_with_now>
<valid_condition_without_now>
{%if not defined(start_date)%}
timestamp BETWEEN now() - interval 30 day
{%else%}
timestamp BETWEEN {{DateTime(start_date)}}
{%end%}
{%if not defined(end_date)%}
AND now()
{%else%}
AND {{DateTime(end_date)}}
{%end%}
</valid_condition_without_now>
- Parameters must not be quoted.
- When you use defined function with a paremeter inside, do NOT add quotes around the parameter:
<invalid_defined_function_with_parameter>{% if defined('my_param') %}</invalid_defined_function_with_parameter>
<valid_defined_function_without_parameter>{% if defined(my_param) %}</valid_defined_function_without_parameter>
- Use datasource names as table names when doing SELECT statements.
- Do not use pipe names as table names.
- The available datasource names to use in the SQL are the ones present in the existing_resources section or the ones you will create.
- Use node names as table names only when nodes are present in the same file.
- Do not reference the current node name in the SQL.
- SQL queries only accept SELECT statements with conditions, aggregations, joins, etc.
- Do NOT use CREATE TABLE, INSERT INTO, CREATE DATABASE, etc.
- Use ONLY SELECT statements in the SQL section.
- INSERT INTO is not supported in SQL section.
- General functions supported are: ['BLAKE3', 'CAST', 'CHARACTER_LENGTH', 'CHAR_LENGTH', 'CRC32', 'CRC32IEEE', 'CRC64', 'DATABASE', 'DATE', 'DATE_DIFF', 'DATE_FORMAT', 'DATE_TRUNC', 'DAY', 'DAYOFMONTH', 'DAYOFWEEK', 'DAYOFYEAR', 'FORMAT_BYTES', 'FQDN', 'FROM_BASE64', 'FROM_DAYS', 'FROM_UNIXTIME', 'HOUR', 'INET6_ATON', 'INET6_NTOA', 'INET_ATON', 'INET_NTOA', 'IPv4CIDRToRange', 'IPv4NumToString', 'IPv4NumToStringClassC', 'IPv4StringToNum', 'IPv4StringToNumOrDefault', 'IPv4StringToNumOrNull', 'IPv4ToIPv6', 'IPv6CIDRToRange', 'IPv6NumToString', 'IPv6StringToNum', 'IPv6StringToNumOrDefault', 'IPv6StringToNumOrNull', 'JSONArrayLength', 'JSONExtract', 'JSONExtractArrayRaw', 'JSONExtractBool', 'JSONExtractFloat', 'JSONExtractInt', 'JSONExtractKeys', 'JSONExtractKeysAndValues', 'JSONExtractKeysAndValuesRaw', 'JSONExtractRaw', 'JSONExtractString', 'JSONExtractUInt', 'JSONHas', 'JSONKey', 'JSONLength', 'JSONRemoveDynamoDBAnnotations', 'JSONType', 'JSON_ARRAY_LENGTH', 'JSON_EXISTS', 'JSON_QUERY', 'JSON_VALUE', 'L1Distance', 'L1Norm', 'L1Normalize', 'L2Distance', 'L2Norm', 'L2Normalize', 'L2SquaredDistance', 'L2SquaredNorm', 'LAST_DAY', 'LinfDistance', 'LinfNorm', 'LinfNormalize', 'LpDistance', 'LpNorm', 'LpNormalize', 'MACNumToString', 'MACStringToNum', 'MACStringToOUI', 'MAP_FROM_ARRAYS', 'MD4', 'MD5', 'MILLISECOND', 'MINUTE', 'MONTH', 'OCTET_LENGTH', 'QUARTER', 'REGEXP_EXTRACT', 'REGEXP_MATCHES', 'REGEXP_REPLACE', 'SCHEMA', 'SECOND', 'SHA1', 'SHA224', 'SHA256', 'SHA384', 'SHA512', 'SHA512_256', 'SUBSTRING_INDEX', 'SVG', 'TIMESTAMP_DIFF', 'TO_BASE64', 'TO_DAYS', 'TO_UNIXTIME', 'ULIDStringToDateTime', 'URLHash', 'URLHierarchy', 'URLPathHierarchy', 'UTCTimestamp', 'UTC_timestamp', 'UUIDNumToString', 'UUIDStringToNum', 'UUIDToNum', 'UUIDv7ToDateTime', 'YEAR', 'YYYYMMDDToDate', 'YYYYMMDDToDate32', 'YYYYMMDDhhmmssToDateTime', 'YYYYMMDDhhmmssToDateTime64']
- Character insensitive functions supported are: ['cast', 'character_length', 'char_length', 'crc32', 'crc32ieee', 'crc64', 'database', 'date', 'date_format', 'date_trunc', 'day', 'dayofmonth', 'dayofweek', 'dayofyear', 'format_bytes', 'fqdn', 'from_base64', 'from_days', 'from_unixtime', 'hour', 'inet6_aton', 'inet6_ntoa', 'inet_aton', 'inet_ntoa', 'json_array_length', 'last_day', 'millisecond', 'minute', 'month', 'octet_length', 'quarter', 'regexp_extract', 'regexp_matches', 'regexp_replace', 'schema', 'second', 'substring_index', 'to_base64', 'to_days', 'to_unixtime', 'utctimestamp', 'utc_timestamp', 'year']
- Aggregate functions supported are: ['BIT_AND', 'BIT_OR', 'BIT_XOR', 'COVAR_POP', 'COVAR_SAMP', 'STD', 'STDDEV_POP', 'STDDEV_SAMP', 'VAR_POP', 'VAR_SAMP', 'aggThrow', 'analysisOfVariance', 'anova', 'any', 'anyHeavy', 'anyLast', 'anyLast_respect_nulls', 'any_respect_nulls', 'any_value', 'any_value_respect_nulls', 'approx_top_count', 'approx_top_k', 'approx_top_sum', 'argMax', 'argMin', 'array_agg', 'array_concat_agg', 'avg', 'avgWeighted', 'boundingRatio', 'categoricalInformationValue', 'contingency', 'corr', 'corrMatrix', 'corrStable', 'count', 'covarPop', 'covarPopMatrix', 'covarPopStable', 'covarSamp', 'covarSampMatrix', 'covarSampStable', 'cramersV', 'cramersVBiasCorrected', 'deltaSum', 'deltaSumTimestamp', 'dense_rank', 'entropy', 'exponentialMovingAverage', 'exponentialTimeDecayedAvg', 'exponentialTimeDecayedCount', 'exponentialTimeDecayedMax', 'exponentialTimeDecayedSum', 'first_value', 'first_value_respect_nulls', 'flameGraph', 'groupArray', 'groupArrayInsertAt', 'groupArrayIntersect', 'groupArrayLast', 'groupArrayMovingAvg', 'groupArrayMovingSum', 'groupArraySample', 'groupArraySorted', 'groupBitAnd', 'groupBitOr', 'groupBitXor', 'groupBitmap', 'groupBitmapAnd', 'groupBitmapOr', 'groupBitmapXor', 'groupUniqArray', 'histogram', 'intervalLengthSum', 'kolmogorovSmirnovTest', 'kurtPop', 'kurtSamp', 'lagInFrame', 'largestTriangleThreeBuckets', 'last_value', 'last_value_respect_nulls', 'leadInFrame', 'lttb', 'mannWhitneyUTest', 'max', 'maxIntersections', 'maxIntersectionsPosition', 'maxMappedArrays', 'meanZTest', 'median', 'medianBFloat16', 'medianBFloat16Weighted', 'medianDD', 'medianDeterministic', 'medianExact', 'medianExactHigh', 'medianExactLow', 'medianExactWeighted', 'medianGK', 'medianInterpolatedWeighted', 'medianTDigest', 'medianTDigestWeighted', 'medianTiming', 'medianTimingWeighted', 'min', 'minMappedArrays', 'nonNegativeDerivative', 'nothing', 'nothingNull', 'nothingUInt64', 'nth_value', 'ntile', 'quantile', 'quantileBFloat16', 'quantileBFloat16Weighted', 'quantileDD', 'quantileDeterministic', 'quantileExact', 'quantileExactExclusive', 'quantileExactHigh', 'quantileExactInclusive', 'quantileExactLow', 'quantileExactWeighted', 'quantileGK', 'quantileInterpolatedWeighted', 'quantileTDigest', 'quantileTDigestWeighted', 'quantileTiming', 'quantileTimingWeighted', 'quantiles', 'quantilesBFloat16', 'quantilesBFloat16Weighted', 'quantilesDD', 'quantilesDeterministic', 'quantilesExact', 'quantilesExactExclusive', 'quantilesExactHigh', 'quantilesExactInclusive', 'quantilesExactLow', 'quantilesExactWeighted', 'quantilesGK', 'quantilesInterpolatedWeighted', 'quantilesTDigest', 'quantilesTDigestWeighted', 'quantilesTiming', 'quantilesTimingWeighted', 'rank', 'rankCorr', 'retention', 'row_number', 'sequenceCount', 'sequenceMatch', 'sequenceNextNode', 'simpleLinearRegression', 'singleValueOrNull', 'skewPop', 'skewSamp', 'sparkBar', 'sparkbar', 'stddevPop', 'stddevPopStable', 'stddevSamp', 'stddevSampStable', 'stochasticLinearRegression', 'stochasticLogisticRegression', 'studentTTest', 'sum', 'sumCount', 'sumKahan', 'sumMapFiltered', 'sumMapFilteredWithOverflow', 'sumMapWithOverflow', 'sumMappedArrays', 'sumWithOverflow', 'theilsU', 'topK', 'topKWeighted', 'uniq', 'uniqCombined', 'uniqCombined64', 'uniqExact', 'uniqHLL12', 'uniqTheta', 'uniqUpTo', 'varPop', 'varPopStable', 'varSamp', 'varSampStable', 'welchTTest', 'windowFunnel']
- Do not use any function that is not present in the list of general functions, character insensitive functions and aggregate functions.
- If the function is not present in the list, the sql query will fail, so avoid at all costs to use any function that is not present in the list.
- When aliasing a column, use first the column name and then the alias.
- General functions and aggregate functions are case sensitive.
- Character insensitive functions are case insensitive.
- Parameters are never quoted in any case.
</sql_instructions>
<datasource_content>
DESCRIPTION >
Some meaningful description of the datasource
SCHEMA >
`column_name_1` clickhouse_tinybird_compatible_data_type `json:$.column_name_1`,
`column_name_2` clickhouse_tinybird_compatible_data_type `json:$.column_name_2`,
...
`column_name_n` clickhouse_tinybird_compatible_data_type `json:$.column_name_n`
ENGINE "MergeTree"
ENGINE_PARTITION_KEY "partition_key"
ENGINE_SORTING_KEY "sorting_key_1, sorting_key_2, ..."
</datasource_content>
<pipe_content>
DESCRIPTION >
Some meaningful description of the pipe
NODE node_1
SQL >
[sql query using clickhouse syntax and tinybird templating syntax and starting always with SELECT or %
SELECT]
TYPE endpoint
</pipe_content>
<copy_pipe_instructions>
- Do not create copy pipes by default, unless the user asks for it.
- In a .pipe file you can define how to export the result of a Pipe to a Data Source, optionally with a schedule.
- Do not include COPY_SCHEDULE in the .pipe file if it is not requested by the user.
- COPY_SCHEDULE is a cron expression that defines the schedule of the copy pipe.
- COPY_SCHEDULE is optional and if not provided, the copy pipe will be executed only once.
- TARGET_DATASOURCE is the name of the Data Source to export the result to.
- TYPE COPY is the type of the pipe and it is mandatory for copy pipes.
- If the copy pipe uses parameters, you must include the % character and a newline on top of every query to be able to use the parameters.
- The content of the .pipe file must follow this format:
DESCRIPTION Copy Pipe to export sales hour every hour to the sales_hour_copy Data Source
NODE daily_sales
SQL >
%
SELECT toStartOfDay(starting_date) day, country, sum(sales) as total_sales
FROM teams
WHERE
day BETWEEN toStartOfDay(now()) - interval 1 day AND toStartOfDay(now())
and country = {{ String(country, 'US')}}
GROUP BY day, country
TYPE COPY
TARGET_DATASOURCE sales_hour_copy
COPY_SCHEDULE 0 * * * *
</copy_pipe_instructions>
<materialized_pipe_instructions>
- Do not create materialized pipes by default, unless the user asks for it.
- In a .pipe file you can define how to materialize each row ingested in the earliest Data Source in the Pipe query to a materialized Data Source. Materialization happens at ingest.
- DATASOURCE: Required when TYPE is MATERIALIZED. Sets the destination Data Source for materialized nodes.
- TYPE MATERIALIZED is the type of the pipe and it is mandatory for materialized pipes.
- The content of the .pipe file must follow this format:
DESCRIPTION Materialized Pipe to aggregate sales per hour in the sales_by_hour Data Source
- Use State modifier for the aggregated columns in the pipe.
- Keep the SQL query simple and avoid using complex queries with joins, subqueries, etc.
</materialized_pipe_instructions>
<materialized_pipe_content>
NODE daily_sales
SQL >
SELECT toStartOfDay(starting_date) day, country, sumState(sales) as total_sales
FROM teams
GROUP BY day, country
TYPE MATERIALIZED
DATASOURCE sales_by_hour
</materialized_pipe_content>
<target_datasource_instructions>
- The target datasource of a materialized pipe must have an AggregatingMergeTree engine.
- Use AggregateFunction for the aggregated columns in the pipe.
- Pipes using a materialized data source must use the Merge modifier in the SQL query for the aggregated columns. Example: sumMerge(total_sales)
- Put all dimensions in the ENGINE_SORTING_KEY, sort first by the lowest granularity and then by the highest.
- If the ENGINE_SORTING_KEY has more than 4 columns, add an ENGINE_PRIMARY_KEY with the first 3 columns of the ENGINE_SORTING_KEY.
</target_datasource_instructions>
<target_datasource_content>
SCHEMA >
`total_sales` AggregateFunction(sum, Float64),
`column_name_2` AggregateFunction(avg, Float64),
`dimension_1` String,
`dimension_2` String,
...
`date` DateTime
ENGINE "AggregatingMergeTree"
ENGINE_PARTITION_KEY "toYYYYMM(date)"
ENGINE_SORTING_KEY "date, dimension_1, dimension_2, ..."
</target_datasource_content>
<test_file_instructions>
Follow these instructions when creating or updating .yaml files for tests:
- The test file name must match the name of the pipe it is testing.
- Every scenario name must be unique inside the test file.
- When looking for the parameters available, you will find them in the pipes in the following format: {{{{String(my_param_name, default_value)}}}}.
- If there are no parameters, you can omit parameters and generate a single test.
- The format of the parameters is the following: param1=value1¶m2=value2¶m3=value3
- If some parameters are provided by the user and you need to use them, preserve in the same format as they were provided, like case sensitive
- Test as many scenarios as possible.
- The format of the test file is the following:
<test_file_format>
- name: kpis_single_day
description: Test hourly granularity for a single day
parameters: date_from=2024-01-01&date_to=2024-01-01
expected_result: |
{"date":"2024-01-01 00:00:00","visits":0,"pageviews":0,"bounce_rate":null,"avg_session_sec":0}
{"date":"2024-01-01 01:00:00","visits":0,"pageviews":0,"bounce_rate":null,"avg_session_sec":0}
- name: kpis_date_range
description: Test daily granularity for a date range
parameters: date_from=2024-01-01&date_to=2024-01-31
expected_result: |
{"date":"2024-01-01","visits":0,"pageviews":0,"bounce_rate":null,"avg_session_sec":0}
{"date":"2024-01-02","visits":0,"pageviews":0,"bounce_rate":null,"avg_session_sec":0}
- name: kpis_default_range
description: Test default behavior without date parameters (last 7 days)
parameters: ''
expected_result: |
{"date":"2025-01-10","visits":0,"pageviews":0,"bounce_rate":null,"avg_session_sec":0}
{"date":"2025-01-11","visits":0,"pageviews":0,"bounce_rate":null,"avg_session_sec":0}
- name: kpis_fixed_time
description: Test with fixed timestamp for consistent testing
parameters: fixed_time=2024-01-15T12:00:00
expected_result: ''
- name: kpis_single_day
description: Test single day with hourly granularity
parameters: date_from=2024-01-01&date_to=2024-01-01
expected_result: |
{"date":"2024-01-01 00:00:00","visits":0,"pageviews":0,"bounce_rate":null,"avg_session_sec":0}
{"date":"2024-01-01 01:00:00","visits":0,"pageviews":0,"bounce_rate":null,"avg_session_sec":0}
</test_file_format>
</test_file_instructions>