Skip to content

Commit

Permalink
Merge branch 'main' into list_environments
Browse files Browse the repository at this point in the history
  • Loading branch information
lafirm authored Feb 9, 2025
2 parents ec4e758 + 76a8c24 commit 4576b7d
Show file tree
Hide file tree
Showing 81 changed files with 1,406 additions and 521 deletions.
18 changes: 11 additions & 7 deletions .circleci/manage-test-db.sh
Original file line number Diff line number Diff line change
Expand Up @@ -97,12 +97,16 @@ bigquery_init() {
echo "$BIGQUERY_KEYFILE_CONTENTS" > $BIGQUERY_KEYFILE
}

bigquery_up() {
echo "BigQuery doesnt support creating databases"
}

bigquery_down() {
echo "BigQuery doesnt support dropping databases"
# Clickhouse cloud
clickhouse-cloud_init() {
# note: the ping endpoint doesnt seem to need any API keys
until curl https://$CLICKHOUSE_CLOUD_HOST:8443/ping
do
echo "Pinging Clickhouse Cloud service to ensure it's not in idle mode..."
sleep 5
done
echo "Clickhouse Cloud instance $CLICKHOUSE_CLOUD_HOST is up and running"
}

INIT_FUNC="${ENGINE}_init"
Expand All @@ -118,10 +122,10 @@ fi
echo "Initializing $ENGINE"
$INIT_FUNC

if [ "$DIRECTION" == "up" ]; then
if [ "$DIRECTION" == "up" ] && function_exists $UP_FUNC; then
echo "Creating database $DB_NAME"
$UP_FUNC $DB_NAME
elif [ "$DIRECTION" == "down" ]; then
elif [ "$DIRECTION" == "down" ] && function_exists $DOWN_FUNC; then
echo "Dropping database $DB_NAME"
$DOWN_FUNC $DB_NAME
fi
Expand Down
2 changes: 1 addition & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -211,7 +211,7 @@ redshift-test: guard-REDSHIFT_HOST guard-REDSHIFT_USER guard-REDSHIFT_PASSWORD g
pytest -n auto -x -m "redshift" --retries 3 --junitxml=test-results/junit-redshift.xml

clickhouse-cloud-test: guard-CLICKHOUSE_CLOUD_HOST guard-CLICKHOUSE_CLOUD_USERNAME guard-CLICKHOUSE_CLOUD_PASSWORD engine-clickhouse-install
pytest -n auto -x -m "clickhouse_cloud" --retries 3 --junitxml=test-results/junit-clickhouse-cloud.xml
pytest -n 1 -m "clickhouse_cloud" --retries 3 --junitxml=test-results/junit-clickhouse-cloud.xml

athena-test: guard-AWS_ACCESS_KEY_ID guard-AWS_SECRET_ACCESS_KEY guard-ATHENA_S3_WAREHOUSE_LOCATION engine-athena-install
pytest -n auto -x -m "athena" --retries 3 --retry-delay 10 --junitxml=test-results/junit-athena.xml
40 changes: 36 additions & 4 deletions docs/cloud/features/observability/prod_environment.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,19 +2,42 @@

A data transformation system's most important component is the production environment, which provides the data your business runs on.

Tobiko Cloud makes it easy to understand your production environment, embedding three observability features directly on your project's homepage:
When you first log in to Tobiko Cloud, you'll see the production environment page. This page shows you at a glance if your data systems are working properly.

It helps data teams quickly check their work without having to dig through complicated logs - just look at the visual dashboard, and you'll know if everything is running smoothly.

![tcloud prod env](./prod_environment/tcloud_prod_environment.png)

## When you might use this

**After a production update**

The dashboard helps you check if your recent updates to production are working correctly. It uses a simple color system to show you what's happening: green means everything is good, and red shows where there might be problems.

If you see red in your current run, plan or freshness, it means there's a problem that needs your attention. Don't worry about red marks from the past (in the historical and previous runs/plans) - these are old issues that have already been fixed.

Best part? You can check all of this in about 5-10 seconds.

**Quick cost check**

The homepage also displays cost metrics for your production environment, a feature exclusive to production (not available in development environments). This allows you to quickly understand and monitor your team's model execution costs without diving into detailed reports.

## Observing production

Tobiko Cloud makes it easy to understand your production environment, embedding four observability features directly on your project's homepage:

1. [Model Freshness chart](./model_freshness.md)
2. Runs and plans chart
3. Recent activity table
4. Warehouse costs overview

![tcloud prod env](./prod_environment/tcloud_prod_environment_labelled.png)

!!! Note

Model freshness has its own feature page - learn more [here](./model_freshness.md)!

## Runs and Plans Chart
### Runs and Plans Chart

SQLMesh performs two primary actions: running the project's models on a cadence and applying plans to update the project's content/behavior.

Expand All @@ -31,24 +54,33 @@ Each day displays zero or more vertical bars representing `run` duration. If no
The chart's `y-axis` represents `run` duration. The height of each `run`'s bar corresponds to its duration, allowing you to quickly assess execution times.

For example, consider the leftmost entry in the figure above:

- The label at the top of the chart shows that it represents November 26
- The entry consists of a single green bar, which tells us that one successful `run` occurred
- The bottom of the bar begins at 0 seconds on the `y-axis`, and the top of the bar ends at 20 seconds, telling us the `run` took 20 seconds to execute

In contrast, consider the rightmost entry in the figure above:

- The label at the top of the chart shows that it represents December 9
- The entry contains two green bars, which tells us that two successful `run`s occurred
- The lower bar begins at 0 seconds on the `y-axis` and reaches up to 13 seconds, telling us the `run` took 13 seconds to execute
- The upper bar begins at 13 seconds on the `y-axis` and reaches up to 22 seconds, telling us that the `run` took 22 - 13 = 9 seconds to execute

Learn more about a `run` or `plan` by hovering over its bar, which displays a link to its page, its start and end times, and its duration.

## Recent Activity Table
### Recent Activity Table

The recent activity table provides comprehensive information about recent project activities, displaying both `run`s and `plan`s in chronological order. This provides a more granular view than the runs and plans chart.

For each activity entry, you can view its completion status, estimated cost of execution (BigQuery and Snowflake engines only), total duration from start to finish, start and completion times, and a unique identification hash for reference purposes.

![tcloud recent activity](./prod_environment/recent_activity.png)

The table provides the ability to filter which rows are displayed by typing into the text box in the top right. This helps you locate specific information within the activity log, making it easier to find and analyze particular events or patterns in your system's operational history.
The table provides the ability to filter which rows are displayed by typing into the text box in the top right. This helps you locate specific information within the activity log, making it easier to find and analyze particular events or patterns in your system's operational history.

### Warehouse Costs Overview
Managing data warehouse costs can be complex. Tobiko Cloud simplifies this by monitoring costs directly. For BigQuery and Snowflake projects, it tracks cost estimates per model and calculates savings from avoided model reruns.

The costs and savings summary information and chart display the costs to run and host all the models in your production environment over the last 30 days. This provides a great way to quickly see increases and decreases in daily running costs. To learn more, [check out the cost savings docs](../costs_savings.md).

![tcloud recent activity](./prod_environment/costs.png)
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
10 changes: 6 additions & 4 deletions docs/cloud/tcloud_getting_started.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
# Tobiko Cloud: Getting Started

<div style="position: relative; padding-bottom: 56.25%; height: 0;"><iframe src="https://www.loom.com/embed/bfd7da9166324b0e987c3824f824d929?sid=5d14308f-1e2c-4883-8ac0-29173a698f71" frameborder="0" webkitallowfullscreen mozallowfullscreen allowfullscreen style="position: absolute; top: 0; left: 0; width: 100%; height: 100%;"></iframe></div>

Tobiko Cloud is a data platform that extends SQLMesh to make it easy to manage data at scale without the waste.

We're here to make it easy to get started and feel confident that everything is working as expected. After you've completed the steps below, you'll have achieved the following:
Expand Down Expand Up @@ -225,10 +227,10 @@ Now we're ready to connect your data warehouse to Tobiko Cloud:
skip_pr_backfill: false
enable_deploy_command: true
auto_categorize_changes:
external: full
python: full
sql: full
seed: full
external: full
python: full
sql: full
seed: full
# preview data for forward only models
plan:
Expand Down
24 changes: 24 additions & 0 deletions docs/concepts/macros/jinja_macros.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,6 +50,30 @@ JINJA_STATEMENT_BEGIN;
JINJA_END;
```

## SQLMesh predefined variables

SQLMesh provides multiple [predefined macro variables](./macro_variables.md) you may reference in jinja code.

Some predefined variables provide information about the SQLMesh project itself, like the [`runtime_stage`](./macro_variables.md#runtime-variables) and [`this_model`](./macro_variables.md#runtime-variables) variables.

Other predefined variables are [temporal](./macro_variables.md#temporal-variables), like `start_ds` and `execution_date`. They are used to build incremental model queries and are only available in incremental model kinds.

Access predefined macro variables by passing their unquoted name in curly braces. For example, this demonstrates how to access the `start_ds` and `end_ds` variables:

```sql linenums="1"
JINJA_QUERY_BEGIN;

SELECT *
FROM table
WHERE time_column BETWEEN '{{ start_ds }}' and '{{ end_ds }}';

JINJA_END;
```

Because the two macro variables return string values, we must surround the curly braces with single quotes `'`. Other macro variables, such as `start_epoch`, return numeric values and do not require the single quotes.

The `gateway` variable uses a slightly different syntax than other predefined variables because it is a function call. Instead of the bare name `{{ gateway }}`, it must include parentheses: `{{ gateway() }}`.

## User-defined variables

SQLMesh supports two kinds of user-defined macro variables: global and local.
Expand Down
2 changes: 2 additions & 0 deletions docs/concepts/macros/macro_variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,6 +126,8 @@ SQLMesh provides two other predefined variables used to modify model behavior ba
* 'loading' - The project is being loaded into SQLMesh's runtime context.
* 'creating' - The model tables are being created.
* 'evaluating' - The model query logic is being evaluated.
* 'promoting' - The model is being promoted in the target environment (virtual layer update).
* 'auditing' - The audit is being run.
* 'testing' - The model query logic is being evaluated in the context of a unit test.
* @gateway - A string value containing the name of the current [gateway](../../guides/connections.md).
* @this_model - A string value containing the name of the physical table the model view selects from. Typically used to create [generic audits](../audits.md#generic-audits). In the case of [on_virtual_update statements](../models/sql_models.md#optional-on-virtual-update-statements) it contains the qualified view name instead.
Expand Down
6 changes: 5 additions & 1 deletion docs/faq/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,11 @@

SQLMesh’s `plan` command is the primary tool for understanding the effects of changes you make to your project. If your project files have changed or are different from the state of an environment, you execute `sqlmesh plan [environment name]` to synchronize the environment's state with your project files. `sqlmesh plan` will generate a summary of the actions needed to implement the changes, automatically run unit tests, and prompt you to `apply` the plan and implement the changes.

If your project files have not changed, you execute `sqlmesh run` to run your project's models and audits. You can execute `sqlmesh run` yourself or with the native [Airflow integration](../integrations/airflow.md). If running it yourself, a sensible approach is to use Linux’s `cron` tool to execute `sqlmesh run` on a cadence at least as frequent as your briefest SQLMesh model `cron` parameter. For example, if your most frequent model’s `cron` is hour, your `cron` tool should execute `sqlmesh run` at least every hour.
If your project files have not changed, you execute `sqlmesh run` to run your project's models and audits.

`sqlmesh run` does not use models, macros, or audits from your local project files. Everything it executes is based on the model, macro, and audit versions currently promoted in the target environment. Those versions are stored in the metadata SQLMesh captures about the state of your environment.

A sensible approach to executing `sqlmesh run` is to use Linux’s `cron` tool to execute `sqlmesh run` on a cadence at least as frequent as your briefest SQLMesh model `cron` parameter. For example, if your most frequent model’s `cron` is hour, your `cron` tool should execute `sqlmesh run` at least every hour.

??? question "What are start date and end date for?"
SQLMesh uses the ["intervals" approach](https://tobikodata.com/data_load_patterns_101.html) to determine the date ranges that should be included in an incremental by time model query. It divides time into disjoint intervals and tracks which intervals have ever been processed.
Expand Down
20 changes: 20 additions & 0 deletions docs/integrations/engines/clickhouse.md
Original file line number Diff line number Diff line change
Expand Up @@ -394,6 +394,26 @@ If a model has many records in each partition, you may see additional performanc
## Local/Built-in Scheduler
**Engine Adapter Type**: `clickhouse`

## Airflow Scheduler
**Engine Name:** `clickhouse`

In order to share a common implementation across local and Airflow, SQLMesh ClickHouse implements its own hook and operator.

By default, the connection ID is set to `sqlmesh_clickhouse_default`, but can be overridden using the `engine_operator_args` parameter to the `SQLMeshAirflow` instance as in the example below:
```python linenums="1"
from sqlmesh.schedulers.airflow import NO_DEFAULT_CATALOG
sqlmesh_airflow = SQLMeshAirflow(
"clickhouse",
default_catalog=NO_DEFAULT_CATALOG,
engine_operator_args={
"sqlmesh_clickhouse_conn_id": "<Connection ID>"
},
)
```

Note: `NO_DEFAULT_CATALOG` is required for ClickHouse since ClickHouse doesn't support catalogs.

### Connection options

| Option | Description | Type | Required |
Expand Down
10 changes: 6 additions & 4 deletions docs/integrations/engines/databricks.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ SQLMesh connects to Databricks with the [Databricks SQL Connector](https://docs.

The SQL Connector is bundled with SQLMesh and automatically installed when you include the `databricks` extra in the command `pip install "sqlmesh[databricks]"`.

The SQL Connector has all the functionality needed for SQLMesh to execute SQL models on Databricks and Python models locally (the default SQLMesh approach).
The SQL Connector has all the functionality needed for SQLMesh to execute SQL models on Databricks and Python models that do not return PySpark DataFrames.

The SQL Connector does not support Databricks Serverless Compute. If you require Serverless Compute then you must use the Databricks Connect library.
If you have Python models returning PySpark DataFrames, check out the [Databricks Connect](#databricks-connect-1) section.

### Databricks Connect

Expand Down Expand Up @@ -229,7 +229,9 @@ If you want Databricks to process PySpark DataFrames in SQLMesh Python models, t

SQLMesh **DOES NOT** include/bundle the Databricks Connect library. You must [install the version of Databricks Connect](https://docs.databricks.com/en/dev-tools/databricks-connect/python/install.html) that matches the Databricks Runtime used in your Databricks cluster.

SQLMesh's Databricks Connect implementation supports Databricks Runtime 13.0 or higher. If SQLMesh detects that you have Databricks Connect installed, then it will use it for all Python models (both Pandas and PySpark DataFrames).
If SQLMesh detects that you have Databricks Connect installed, then it will automatically configure the connection and use it for all Python models that return a Pandas or PySpark DataFrame.

To have databricks-connect installed but ignored by SQLMesh, set `disable_databricks_connect` to `true` in the connection configuration.

Databricks Connect can execute SQL and DataFrame operations on different clusters by setting the SQLMesh `databricks_connect_*` connection options. For example, these options could configure SQLMesh to run SQL on a [Databricks SQL Warehouse](https://docs.databricks.com/sql/admin/create-sql-warehouse.html) while still routing DataFrame operations to a normal Databricks Cluster.

Expand Down Expand Up @@ -259,7 +261,7 @@ The only relevant SQLMesh configuration parameter is the optional `catalog` para
| `databricks_connect_server_hostname` | Databricks Connect Only: Databricks Connect server hostname. Uses `server_hostname` if not set. | string | N |
| `databricks_connect_access_token` | Databricks Connect Only: Databricks Connect access token. Uses `access_token` if not set. | string | N |
| `databricks_connect_cluster_id` | Databricks Connect Only: Databricks Connect cluster ID. Uses `http_path` if not set. Cannot be a Databricks SQL Warehouse. | string | N |
| `databricks_connect_use_serverless` | Databricks Connect Only: Use a serverless cluster for Databricks Connect. If using serverless then SQL connector is disabled since Serverless is not supported for SQL Connector | bool | N |
| `databricks_connect_use_serverless` | Databricks Connect Only: Use a serverless cluster for Databricks Connect instead of `databricks_connect_cluster_id`. | bool | N |
| `force_databricks_connect` | When running locally, force the use of Databricks Connect for all model operations (so don't use SQL Connector for SQL models) | bool | N |
| `disable_databricks_connect` | When running locally, disable the use of Databricks Connect for all model operations (so use SQL Connector for all models) | bool | N |
| `disable_spark_session` | Do not use SparkSession if it is available (like when running in a notebook). | bool | N |
Expand Down
Loading

0 comments on commit 4576b7d

Please sign in to comment.