add option to partition large models#939
add option to partition large models#939rrbarbosa wants to merge 1 commit intoelementary-data:masterfrom
Conversation
|
👋 @rrbarbosa |
📝 WalkthroughWalkthroughThis change introduces partitioning support for dbt artifact tables by adding configuration options to enable partitioning of run results and invocations tables by creation timestamp in BigQuery, with integration tests to verify the feature works correctly. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
🧹 Nitpick comments (3)
integration_tests/tests/test_dbt_artifacts/test_artifacts.py (2)
177-179: Consider usingread_tablefor consistency with other tests.While
TEST_MODELis a hardcoded constant (so the static analysis SQL injection warning is a false positive), usingread_tablewould be more consistent with the pattern used intest_dbt_invocations_partitionedand other tests in this file.♻️ Suggested refactor for consistency
- results = dbt_project.run_query( - """select * from {{ ref("dbt_run_results") }} where name='%s'""" % TEST_MODEL - ) - assert len(results) >= 1 + dbt_project.read_table( + "dbt_run_results", where=f"name = '{TEST_MODEL}'", raise_if_empty=True + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py` around lines 177 - 179, Replace the raw SQL call to dbt_project.run_query that embeds TEST_MODEL with the consistent helper method read_table used elsewhere (e.g., in test_dbt_invocations_partitioned): call dbt_project.read_table or the test file's read_table helper to query dbt_run_results filtered by TEST_MODEL instead of using string interpolation; update the line using dbt_project.run_query(...) to use read_table with the same filter so the test follows the established pattern and avoids the apparent SQL-injection style interpolation.
170-191: Consider verifying partitioning was actually applied.The tests verify that data is readable after enabling partitioning, which is a good smoke test. For more confidence, you could add an assertion that the table is actually partitioned by querying BigQuery's
INFORMATION_SCHEMA.PARTITIONSorTABLE_OPTIONS.Example verification query
# After the run, verify the table is partitioned: partition_info = dbt_project.run_query( """ SELECT option_value FROM `{{ ref("dbt_run_results").database }}`.`{{ ref("dbt_run_results").schema }}`.INFORMATION_SCHEMA.TABLE_OPTIONS WHERE table_name = 'dbt_run_results' AND option_name = 'partition_expiration_days' """ ) # Or check PARTITIONS table for partition existence🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py` around lines 170 - 191, Add an explicit assertion that the BigQuery tables are actually partitioned after enabling partition_run_results in the tests test_run_results_partitioned and test_dbt_invocations_partitioned: after calling dbt_project.dbt_runner.run (and after dbt_project.read_table where appropriate), run a query via dbt_project.run_query against BigQuery's INFORMATION_SCHEMA.TABLE_OPTIONS or the PARTITIONS view for the dbt_run_results table (use the referenced table name via {{ ref("dbt_run_results") }} or TEST_MODEL) and assert the expected partition option or presence of partitions (e.g., option_name like 'partition_expiration_days' or non-empty PARTITION rows) to ensure partitioning was applied.macros/edr/system/system_utils/get_config_var.sql (1)
85-98: Partitioning is silently ignored on non-BigQuery adapters.When
partition_run_resultsis enabled on adapters other than BigQuery,run_results_partition_byremainsnone, so partitioning won't actually occur. This could confuse users who expect the feature to work.Consider either:
- Adding default partition specs for other adapters that support partitioning (Snowflake, Databricks)
- Logging a warning when
partition_run_results=truebutrun_results_partition_byisnone- Documenting that this feature currently only works on BigQuery
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@macros/edr/system/system_utils/get_config_var.sql` around lines 85 - 98, The current macro bigquery__get_default_config only sets run_results_partition_by for BigQuery, which means partition_run_results can be true but ignored for other adapters; update the default handling so that when partition_run_results is true and run_results_partition_by is none you either (a) set sensible defaults for other partitioning-capable adapters (e.g., Snowflake/Databricks) by adding adapter-specific branches that populate run_results_partition_by, or (b) emit a clear warning/log when partition_run_results=true but run_results_partition_by remains none to inform users; locate the logic around default__get_default_config, bigquery__get_default_config and the keys 'partition_run_results'/'run_results_partition_by' to apply the chosen fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py`:
- Around line 177-179: Replace the raw SQL call to dbt_project.run_query that
embeds TEST_MODEL with the consistent helper method read_table used elsewhere
(e.g., in test_dbt_invocations_partitioned): call dbt_project.read_table or the
test file's read_table helper to query dbt_run_results filtered by TEST_MODEL
instead of using string interpolation; update the line using
dbt_project.run_query(...) to use read_table with the same filter so the test
follows the established pattern and avoids the apparent SQL-injection style
interpolation.
- Around line 170-191: Add an explicit assertion that the BigQuery tables are
actually partitioned after enabling partition_run_results in the tests
test_run_results_partitioned and test_dbt_invocations_partitioned: after calling
dbt_project.dbt_runner.run (and after dbt_project.read_table where appropriate),
run a query via dbt_project.run_query against BigQuery's
INFORMATION_SCHEMA.TABLE_OPTIONS or the PARTITIONS view for the dbt_run_results
table (use the referenced table name via {{ ref("dbt_run_results") }} or
TEST_MODEL) and assert the expected partition option or presence of partitions
(e.g., option_name like 'partition_expiration_days' or non-empty PARTITION rows)
to ensure partitioning was applied.
In `@macros/edr/system/system_utils/get_config_var.sql`:
- Around line 85-98: The current macro bigquery__get_default_config only sets
run_results_partition_by for BigQuery, which means partition_run_results can be
true but ignored for other adapters; update the default handling so that when
partition_run_results is true and run_results_partition_by is none you either
(a) set sensible defaults for other partitioning-capable adapters (e.g.,
Snowflake/Databricks) by adding adapter-specific branches that populate
run_results_partition_by, or (b) emit a clear warning/log when
partition_run_results=true but run_results_partition_by remains none to inform
users; locate the logic around default__get_default_config,
bigquery__get_default_config and the keys
'partition_run_results'/'run_results_partition_by' to apply the chosen fix.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (4)
integration_tests/tests/test_dbt_artifacts/test_artifacts.pymacros/edr/system/system_utils/get_config_var.sqlmodels/edr/dbt_artifacts/dbt_invocations.sqlmodels/edr/dbt_artifacts/dbt_run_results.sql
|
about the bot comments:
|
|
Hi @rrbarbosa - thanks for your contribution! Also, not a must, but is it possible to verify in the test that the table is really partitioned? |
The tables dbt_invocations and dbt_run_results run without check. Our org has several dbt projects thus processing it for downstream use cases becomes very costly.
This PR adds support for partitions the data by day, reducing the processing costs dramatically.
I've only added support for BigQuery, as that's the adapter we use.
Similar to what was requested here: elementary-data/elementary#1715
Summary by CodeRabbit
New Features
Tests