Skip to content

Table materialization (V2) leads to unavailability of data #1102

@Jeremynadal33

Description

@Jeremynadal33

Describe the bug

TLDR; stop using create or replace and then insert when table when table already exists (with valid data)

Two problems since materialization V2 for "table":

  • The table materialization does 2 steps (create or replace + insert) that can lead to an unavailability of the data
  • Constraints can be applied before hands (thanks a lot!) but the problem is that if new data arrive violating the said constraints afterward, we recreate the table first, hence removing all data.

The problem is that Delta is not handling transactions across multiple commands... One way to do overcome the second problem would be to add a clause for putting back data using RESTORE but it leaves the first problem.

Then, I thought of using the insert_overwrite incremental strategy that would replace all data in a single transaction, hence not removing existing valid data in a violation constraint scenario but the strategy isn't complete yet (not possible to add columns if I read right) and unavailable for SQL warehouse (even though I don't understand why because here would be a valid case for overwriting completely the table, maybe should have been only a warning in the logs?).

I think using insert_overwrite (without partition predicate) in the table materialization when the relation already exists in the database would be an idea but this may lead to complications (what if a column type changes ?). Happy to discuss !

Steps To Reproduce

Create a model toto.sql :

{{
  config(
    materialized = 'table',
    )
}}

{% if load_relation(this) %} {# meaning it exists in the database #}
  select null as name
  union all
  select 'toto' as name
{% else %} {# meaning it does not exist in the database #}
  select 'toto' as name
{% endif %}

And its corresponding toto.yml

models:
  - name: toto
    config:
      contract:
        enforced: true
    columns:
      - name: name
        data_type: string
        constraints:
          - type: not_null

Then run :
dbt run -s toto --> Succeed + select * from my_catalog.my_schema.toto returns 1 row
dbt run -s toto --> Fails + select * from my_catalog.my_schema.toto returns 0 row

Expected behavior

I want my valid data not to be removed and a 0 downtime of my table.

System information

The output of dbt --version:

Core:
  - installed: 1.10.2
  - latest:    1.10.5 - Update available!

  Your version of dbt-core is out of date!
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

Plugins:
  - databricks: 1.10.4 - Up to date!
  - spark:      1.9.2  - Update available!

  At least one plugin is out of date with dbt-core.
  You can find instructions for upgrading here:
  https://docs.getdbt.com/docs/installation

The operating system you're using:

Python 3.13.5

Additional context

Thanks everyone for all the good work !! 🙏 I would be glad to contribute once we decide how to handle this problem ! Don't hesitate to reach out, we can even schedule a call if you have time for this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions