Skip to content

Python model - more customization options #1104

@scdmitry

Description

@scdmitry

Describe the feature

The current Python model macro is quite restrictive:

Would it be possible to offer more customization (e.g. including the overwriteSchema as option into py_get_writer_options() ?
Would it also be possible to allow more customization on the way refs are treated?

Describe alternatives you've considered

The alternative I went for now is overriding this macro. I don't mind, but I'm wondering if there could be a better way?

Additional context

Here is the macro I'm using (yes, i essentially just cut everything out, and changed spark.table to str):

{% macro databricks__py_write_table(compiled_code, target_relation) %}

{{ compiled_code }}

dbt = dbtObj(str)
df = model(dbt, spark)

{% endmacro %}

And here is my model:

version: 2

models:
  - name: test_model    
    config:
      test_var: "{{ var('test_var') }}"
      materialized: "table"
      submission_method: "serverless_cluster"
      partition_by:
        - test_var
    columns:
      - name: test_var        
        data_type: int
      - name: some_data        
        data_type: int
from datetime import datetime, timedelta


def model(dbt, session):
    test_var = dbt.config.get("test_var", None)
    partition_by = dbt.config.get('partition_by', [])
    
    query = f"""
    select
        '{test_var}' as test_var,
       123 as some_data
    from {dbt.ref('other_sql_model')}
    where test_var =  '{test_var}'
    """

    df = session.sql(query)
    target_table = f"{dbt.this.database}.{dbt.this.schema}.{dbt.this.identifier}"
    print(f"Target table: {target_table}")
    (
        df.write
        .mode("overwrite")
        .format("delta")
        .option("partitionOverwriteMode", "dynamic")
        .partitionBy(*partition_by)
        .saveAsTable(target_table)
    )
    return session.table(target_table)

Who will this benefit?

I think this customization (and associated documentation) would benefit teams that consider migrating to dbt, but having a lot of legacy python pipelines. Wrapping those pipelines with dbt with such simplified macro will allow "lift and shift" kind of move, and then gradual evolution of the pipelines to become more dbt-friendly.

Are you interested in contributing this feature?

I am happy to contribute, but I'd like to hear some critique and arguments against this. Overall, my primary interest is to enable usage of dbt also for the cases where logic is objectively too complex to be migrated to SQL in any reasonable time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions