Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DBT or SQLMesh to manage SQL queries? #624

Open
nacnudus opened this issue Feb 14, 2024 · 1 comment
Open

Use DBT or SQLMesh to manage SQL queries? #624

nacnudus opened this issue Feb 14, 2024 · 1 comment

Comments

@nacnudus
Copy link
Contributor

nacnudus commented Feb 14, 2024

Currently we manage SQL queries with Makefiles.

  • When one step in a makefile fails, subsequent steps fail (by default)
  • Dependencies between queries must be declared in the makefile as well as the query
  • No column-level lineage, so it is hard to tell the effect of changing one column of one query
  • No local testing - can only be tested in BigQuery itself.

Alternatives

DBT: what most people use

  • Ugly jinja templating
  • No column-level lineage

SQLMesh: a new product

  • Implements everything as a view into a physical table in a different dataset.

Specifics of SQLMesh

  • Is pricing affected by views being defined as SELECT *? No. BigQuery is smart enough to only use the data that is actually used.
  • Permissions? SQLMesh doesn't manage them. In general, SQLMesh assumes that there is only one BigQuery project, whereas we have separate ones for prod, staging and dev. However, we could choose a gateway when invoking a sqlmesh command, to target the project that we need.

Speed

SQLMesh is slow if it stores its state in BigQuery, because BigQuery writes are slow. SQLMesh can be configured to use a different, faster database for its state, e.g. a tiny postgres instance, but that would require a lot more configuration, so we can put up with the slowness until it becomes annoying enough.

Datasets and permissions

We have some existing datasets. In the simple case:

  • publishing-api
  • private
  • public

SQLMesh can be configured to know the schemas of tables in the publishing-api dataset.

When we define a "model" (SQL query) to derive data from the publishing-api dataset and to put the results into the private and public datasets, SQLMesh will automatically create two more datasets, with oddly named tables. For example, if we create a table called public.current_editions, we will end up with:

  • public
    • current_editions: a view defined as SELECT * FROM sqlmesh__public.public__current_editions_123456789
  • sqlmesh__public
    • public__current_editions__123456789: a physical table
    • public__current_editions__123456789__temp: another physical table (empty?)

If we are disciplined about only using SQLMesh for the production 'environment', then it will only create one dataset per dataset that we use, and they will have predictable names, so we could use terraform to authorise all the views in the public dataset to access data in the sqlmesh__public dataset.

If we are ill-disciplined, and use SQLMesh to create other 'environments', then it will create other datasets such as public__dev and public__staging, which will lack permissions to access the data in the dataset sqlmesh__public.

@nacnudus
Copy link
Contributor Author

nacnudus commented Feb 6, 2025

GCP DataFlow is now standard in the team.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant