-
Notifications
You must be signed in to change notification settings - Fork 147
Description
Describe the feature
The current Python model macro is quite restrictive:
- Model is forced to do full overwrite: https://github.com/databricks/dbt-databricks/blob/main/dbt/include/databricks/macros/adapters/python.sql#L42
- Model hardcodes the interface used to read refs.
Would it be possible to offer more customization (e.g. including the overwriteSchema as option into py_get_writer_options() ?
Would it also be possible to allow more customization on the way refs are treated?
Describe alternatives you've considered
The alternative I went for now is overriding this macro. I don't mind, but I'm wondering if there could be a better way?
Additional context
Here is the macro I'm using (yes, i essentially just cut everything out, and changed spark.table to str):
{% macro databricks__py_write_table(compiled_code, target_relation) %}
{{ compiled_code }}
dbt = dbtObj(str)
df = model(dbt, spark)
{% endmacro %}
And here is my model:
version: 2
models:
- name: test_model
config:
test_var: "{{ var('test_var') }}"
materialized: "table"
submission_method: "serverless_cluster"
partition_by:
- test_var
columns:
- name: test_var
data_type: int
- name: some_data
data_type: int
from datetime import datetime, timedelta
def model(dbt, session):
test_var = dbt.config.get("test_var", None)
partition_by = dbt.config.get('partition_by', [])
query = f"""
select
'{test_var}' as test_var,
123 as some_data
from {dbt.ref('other_sql_model')}
where test_var = '{test_var}'
"""
df = session.sql(query)
target_table = f"{dbt.this.database}.{dbt.this.schema}.{dbt.this.identifier}"
print(f"Target table: {target_table}")
(
df.write
.mode("overwrite")
.format("delta")
.option("partitionOverwriteMode", "dynamic")
.partitionBy(*partition_by)
.saveAsTable(target_table)
)
return session.table(target_table)
Who will this benefit?
I think this customization (and associated documentation) would benefit teams that consider migrating to dbt, but having a lot of legacy python pipelines. Wrapping those pipelines with dbt with such simplified macro will allow "lift and shift" kind of move, and then gradual evolution of the pipelines to become more dbt-friendly.
Are you interested in contributing this feature?
I am happy to contribute, but I'd like to hear some critique and arguments against this. Overall, my primary interest is to enable usage of dbt also for the cases where logic is objectively too complex to be migrated to SQL in any reasonable time.