Skip to content

add option to partition large models#939

Open
rrbarbosa wants to merge 1 commit intoelementary-data:masterfrom
rrbarbosa:feat/partition_run_results
Open

add option to partition large models#939
rrbarbosa wants to merge 1 commit intoelementary-data:masterfrom
rrbarbosa:feat/partition_run_results

Conversation

@rrbarbosa
Copy link

@rrbarbosa rrbarbosa commented Feb 26, 2026

The tables dbt_invocations and dbt_run_results run without check. Our org has several dbt projects thus processing it for downstream use cases becomes very costly.

This PR adds support for partitions the data by day, reducing the processing costs dramatically.

I've only added support for BigQuery, as that's the adapter we use.

Similar to what was requested here: elementary-data/elementary#1715

Summary by CodeRabbit

  • New Features

    • Added support for partitioning run results tables with configurable partition strategies.
    • Introduced configuration options for enabling partitioned run results and specifying partition criteria.
    • Added BigQuery-specific partitioning defaults using timestamp granularity.
  • Tests

    • Added integration tests for BigQuery partitioned run results functionality.

@github-actions
Copy link
Contributor

👋 @rrbarbosa
Thank you for raising your pull request.
Please make sure to add tests and document all user-facing changes.
You can do this by editing the docs files in the elementary repository.

@coderabbitai
Copy link

coderabbitai bot commented Feb 26, 2026

📝 Walkthrough

Walkthrough

This change introduces partitioning support for dbt artifact tables by adding configuration options to enable partitioning of run results and invocations tables by creation timestamp in BigQuery, with integration tests to verify the feature works correctly.

Changes

Cohort / File(s) Summary
Integration Tests
integration_tests/tests/test_dbt_artifacts/test_artifacts.py
Added two new tests for BigQuery targets: test_run_results_partitioned verifies partitioned run results data accessibility; test_dbt_invocations_partitioned validates dbt_invocations table readability under partitioned conditions.
Configuration
macros/edr/system/system_utils/get_config_var.sql
Added new config keys partition_run_results (default: false) and run_results_partition_by (default: none). Updated BigQuery defaults to include partition spec with field created_at, data type timestamp, and granularity day.
dbt Models
models/edr/dbt_artifacts/dbt_run_results.sql, models/edr/dbt_artifacts/dbt_invocations.sql
Added conditional partition_by configuration to both models. When partition_run_results is enabled, applies run_results_partition_by partition spec; otherwise defaults to none.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

🐰 ✨ Partitions fair, by days arranged,
The run results table's been rearranged!
With config flags and conditional care,
BigQuery rows now organize with flair.
Tests hop along to verify all's right,
Our artifacts shine in their new delight! 🌟

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'add option to partition large models' directly and clearly describes the main objective of the changeset: adding partitioning support to the dbt_invocations and dbt_run_results models to reduce processing costs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (3)
integration_tests/tests/test_dbt_artifacts/test_artifacts.py (2)

177-179: Consider using read_table for consistency with other tests.

While TEST_MODEL is a hardcoded constant (so the static analysis SQL injection warning is a false positive), using read_table would be more consistent with the pattern used in test_dbt_invocations_partitioned and other tests in this file.

♻️ Suggested refactor for consistency
-    results = dbt_project.run_query(
-        """select * from {{ ref("dbt_run_results") }} where name='%s'""" % TEST_MODEL
-    )
-    assert len(results) >= 1
+    dbt_project.read_table(
+        "dbt_run_results", where=f"name = '{TEST_MODEL}'", raise_if_empty=True
+    )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py` around lines
177 - 179, Replace the raw SQL call to dbt_project.run_query that embeds
TEST_MODEL with the consistent helper method read_table used elsewhere (e.g., in
test_dbt_invocations_partitioned): call dbt_project.read_table or the test
file's read_table helper to query dbt_run_results filtered by TEST_MODEL instead
of using string interpolation; update the line using dbt_project.run_query(...)
to use read_table with the same filter so the test follows the established
pattern and avoids the apparent SQL-injection style interpolation.

170-191: Consider verifying partitioning was actually applied.

The tests verify that data is readable after enabling partitioning, which is a good smoke test. For more confidence, you could add an assertion that the table is actually partitioned by querying BigQuery's INFORMATION_SCHEMA.PARTITIONS or TABLE_OPTIONS.

Example verification query
# After the run, verify the table is partitioned:
partition_info = dbt_project.run_query(
    """
    SELECT option_value 
    FROM `{{ ref("dbt_run_results").database }}`.`{{ ref("dbt_run_results").schema }}`.INFORMATION_SCHEMA.TABLE_OPTIONS
    WHERE table_name = 'dbt_run_results' AND option_name = 'partition_expiration_days'
    """
)
# Or check PARTITIONS table for partition existence
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py` around lines
170 - 191, Add an explicit assertion that the BigQuery tables are actually
partitioned after enabling partition_run_results in the tests
test_run_results_partitioned and test_dbt_invocations_partitioned: after calling
dbt_project.dbt_runner.run (and after dbt_project.read_table where appropriate),
run a query via dbt_project.run_query against BigQuery's
INFORMATION_SCHEMA.TABLE_OPTIONS or the PARTITIONS view for the dbt_run_results
table (use the referenced table name via {{ ref("dbt_run_results") }} or
TEST_MODEL) and assert the expected partition option or presence of partitions
(e.g., option_name like 'partition_expiration_days' or non-empty PARTITION rows)
to ensure partitioning was applied.
macros/edr/system/system_utils/get_config_var.sql (1)

85-98: Partitioning is silently ignored on non-BigQuery adapters.

When partition_run_results is enabled on adapters other than BigQuery, run_results_partition_by remains none, so partitioning won't actually occur. This could confuse users who expect the feature to work.

Consider either:

  1. Adding default partition specs for other adapters that support partitioning (Snowflake, Databricks)
  2. Logging a warning when partition_run_results=true but run_results_partition_by is none
  3. Documenting that this feature currently only works on BigQuery
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@macros/edr/system/system_utils/get_config_var.sql` around lines 85 - 98, The
current macro bigquery__get_default_config only sets run_results_partition_by
for BigQuery, which means partition_run_results can be true but ignored for
other adapters; update the default handling so that when partition_run_results
is true and run_results_partition_by is none you either (a) set sensible
defaults for other partitioning-capable adapters (e.g., Snowflake/Databricks) by
adding adapter-specific branches that populate run_results_partition_by, or (b)
emit a clear warning/log when partition_run_results=true but
run_results_partition_by remains none to inform users; locate the logic around
default__get_default_config, bigquery__get_default_config and the keys
'partition_run_results'/'run_results_partition_by' to apply the chosen fix.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@integration_tests/tests/test_dbt_artifacts/test_artifacts.py`:
- Around line 177-179: Replace the raw SQL call to dbt_project.run_query that
embeds TEST_MODEL with the consistent helper method read_table used elsewhere
(e.g., in test_dbt_invocations_partitioned): call dbt_project.read_table or the
test file's read_table helper to query dbt_run_results filtered by TEST_MODEL
instead of using string interpolation; update the line using
dbt_project.run_query(...) to use read_table with the same filter so the test
follows the established pattern and avoids the apparent SQL-injection style
interpolation.
- Around line 170-191: Add an explicit assertion that the BigQuery tables are
actually partitioned after enabling partition_run_results in the tests
test_run_results_partitioned and test_dbt_invocations_partitioned: after calling
dbt_project.dbt_runner.run (and after dbt_project.read_table where appropriate),
run a query via dbt_project.run_query against BigQuery's
INFORMATION_SCHEMA.TABLE_OPTIONS or the PARTITIONS view for the dbt_run_results
table (use the referenced table name via {{ ref("dbt_run_results") }} or
TEST_MODEL) and assert the expected partition option or presence of partitions
(e.g., option_name like 'partition_expiration_days' or non-empty PARTITION rows)
to ensure partitioning was applied.

In `@macros/edr/system/system_utils/get_config_var.sql`:
- Around line 85-98: The current macro bigquery__get_default_config only sets
run_results_partition_by for BigQuery, which means partition_run_results can be
true but ignored for other adapters; update the default handling so that when
partition_run_results is true and run_results_partition_by is none you either
(a) set sensible defaults for other partitioning-capable adapters (e.g.,
Snowflake/Databricks) by adding adapter-specific branches that populate
run_results_partition_by, or (b) emit a clear warning/log when
partition_run_results=true but run_results_partition_by remains none to inform
users; locate the logic around default__get_default_config,
bigquery__get_default_config and the keys
'partition_run_results'/'run_results_partition_by' to apply the chosen fix.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between f0307f3 and 8ca28e1.

📒 Files selected for processing (4)
  • integration_tests/tests/test_dbt_artifacts/test_artifacts.py
  • macros/edr/system/system_utils/get_config_var.sql
  • models/edr/dbt_artifacts/dbt_invocations.sql
  • models/edr/dbt_artifacts/dbt_run_results.sql

@rrbarbosa
Copy link
Author

about the bot comments:

  • Docstring? Doesn't seem applicable.
  • The tests seem consistent with what's on the repo.
  • Adding a query as part of the test seems like a bad idea to me, I've checked this manually when while using the provide test project. No way for me to test other adapters
  • On the setting being silent ignore in other adapters, fair. But could not find the docs for these settings anywhere. And there's no way for me to implement/test other adapters.

@haritamar
Copy link
Collaborator

Hi @rrbarbosa - thanks for your contribution!
I'm wondering - should we just always set partition fields for BigQuery? Is there a reason users won't want this / to customize it?
What are you setting for these fields? I'm assuming by created_at or some other timestamp?

Also, not a must, but is it possible to verify in the test that the table is really partitioned?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants