Use DBT or SQLMesh to manage SQL queries? #624

nacnudus · 2024-02-14T14:43:01Z

Currently we manage SQL queries with Makefiles.

When one step in a makefile fails, subsequent steps fail (by default)
Dependencies between queries must be declared in the makefile as well as the query
No column-level lineage, so it is hard to tell the effect of changing one column of one query
No local testing - can only be tested in BigQuery itself.

Alternatives

DBT: what most people use

Ugly jinja templating
No column-level lineage

SQLMesh: a new product

Implements everything as a view into a physical table in a different dataset.

Specifics of SQLMesh

Is pricing affected by views being defined as SELECT *? No. BigQuery is smart enough to only use the data that is actually used.
Permissions? SQLMesh doesn't manage them. In general, SQLMesh assumes that there is only one BigQuery project, whereas we have separate ones for prod, staging and dev. However, we could choose a gateway when invoking a sqlmesh command, to target the project that we need.

Speed

SQLMesh is slow if it stores its state in BigQuery, because BigQuery writes are slow. SQLMesh can be configured to use a different, faster database for its state, e.g. a tiny postgres instance, but that would require a lot more configuration, so we can put up with the slowness until it becomes annoying enough.

Datasets and permissions

We have some existing datasets. In the simple case:

publishing-api
private
public

SQLMesh can be configured to know the schemas of tables in the publishing-api dataset.

When we define a "model" (SQL query) to derive data from the publishing-api dataset and to put the results into the private and public datasets, SQLMesh will automatically create two more datasets, with oddly named tables. For example, if we create a table called public.current_editions, we will end up with:

public
- current_editions: a view defined as SELECT * FROM sqlmesh__public.public__current_editions_123456789
sqlmesh__public
- public__current_editions__123456789: a physical table
- public__current_editions__123456789__temp: another physical table (empty?)

If we are disciplined about only using SQLMesh for the production 'environment', then it will only create one dataset per dataset that we use, and they will have predictable names, so we could use terraform to authorise all the views in the public dataset to access data in the sqlmesh__public dataset.

If we are ill-disciplined, and use SQLMesh to create other 'environments', then it will create other datasets such as public__dev and public__staging, which will lack permissions to access the data in the dataset sqlmesh__public.

The text was updated successfully, but these errors were encountered:

nacnudus · 2025-02-06T23:09:14Z

GCP DataFlow is now standard in the team.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DBT or SQLMesh to manage SQL queries? #624

Use DBT or SQLMesh to manage SQL queries? #624

nacnudus commented Feb 14, 2024 •

edited

Loading

nacnudus commented Feb 6, 2025

Use DBT or SQLMesh to manage SQL queries? #624

Use DBT or SQLMesh to manage SQL queries? #624

Comments

nacnudus commented Feb 14, 2024 • edited Loading

Alternatives

DBT: what most people use

SQLMesh: a new product

Specifics of SQLMesh

Speed

Datasets and permissions

nacnudus commented Feb 6, 2025

nacnudus commented Feb 14, 2024 •

edited

Loading