feat: add weekly cleanup workflow for stale CI schemas#943
feat: add weekly cleanup workflow for stale CI schemas#943
Conversation
Add a dbt macro (elementary.drop_stale_ci_schemas) with per-adapter dispatch that lists all schemas matching CI prefixes, parses the embedded YYMMDD_HHMMSS timestamp, and drops schemas older than a configurable threshold (default 24h). Add a GitHub Actions workflow that runs weekly (Sunday 03:00 UTC) against each cloud warehouse target. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
|
👋 @devin-ai-integration[bot] |
📝 WalkthroughWalkthroughAdds a weekly/manual GitHub Actions workflow to drop stale CI schemas and introduces dbt macros and tests to list schemas, detect schema timestamps, check existence across adapters (including ClickHouse), and perform cleanup by age and prefix. Changes
Sequence Diagram(s)sequenceDiagram
participant GH as GitHub Actions
participant CI as dbt runner (in workflow)
participant PKG as local elementary package
participant DB as Warehouse (Snowflake/BigQuery/Redshift/Databricks/Athena)
GH->>CI: checkout repo & setup Python/dbt
CI->>CI: install adapter packages & generate profiles
CI->>PKG: symlink local package into test project
CI->>CI: dbt deps install
CI->>DB: dbt run-operation drop_stale_ci_schemas(prefixes, max_age_hours) [matrix warehouse]
DB-->>CI: execute drops / return results
CI-->>GH: action completes with operation output
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Comment |
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Creates test schemas with old/recent timestamps, runs the cleanup macro, and verifies old schemas are dropped while recent ones are kept. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
macros/utils/ci_cleanup/drop_stale_ci_schemas.sql (2)
77-80: Consider quoting the schema name in the SQL statement.While
schema_nameoriginates fromadapter.list_schemas()and is filtered by prefix/timestamp patterns, using unquoted identifiers in raw SQL is risky if a schema name ever contains special characters.🛡️ Suggested fix with proper quoting
{% macro clickhouse__drop_ci_schema(database, schema_name) %} - {% do run_query("DROP DATABASE IF EXISTS " ~ schema_name) %} + {% do run_query("DROP DATABASE IF EXISTS `" ~ schema_name ~ "`") %} {% do adapter.commit() %} {% endmacro %}ClickHouse uses backticks for identifier quoting. Alternatively, check if the ClickHouse adapter provides an identifier quoting utility.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql` around lines 77 - 80, The DROP DATABASE statement in macro clickhouse__drop_ci_schema uses an unquoted schema_name; update the run_query call to quote the identifier (e.g., wrap schema_name in backticks) or use the adapter's identifier-quoting helper if available, so run_query("DROP DATABASE IF EXISTS " ~ quoted_schema_name) is executed and then call adapter.commit() as before; reference the macro clickhouse__drop_ci_schema, the schema_name variable, the run_query invocation, and adapter.commit() when making the change.
48-51: No error handling for individual schema drop failures.If any
drop_ci_schemacall fails (e.g., due to permissions, locks, or concurrent operations), the entire cleanup aborts. For a maintenance job, you may want to continue cleaning other schemas and report failures at the end.This is noted in the PR description as a known limitation. Consider whether continuing on errors would be preferable for this use case.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql` around lines 48 - 51, The loop currently calls elementary.drop_ci_schema directly so any failure aborts the whole run; change it to attempt each drop inside a guarded block (wrap the call to drop_ci_schema in a try/catch-equivalent Jinja construct), only increment ns.dropped on success, and on failure increment a new ns.failed (or ns.errors) counter and call log(...) with the schema_name and the captured error details so the job continues cleaning other schemas; at the end of the macro emit a summary using ns.dropped and ns.failed to report which drops succeeded and which failed..github/workflows/cleanup-stale-schemas.yml (2)
42-46: Consider pinning dbt-core version for reproducibility.Installing
dbt-corewithout a version constraint means the workflow may behave differently as new versions are released. For a maintenance workflow, this is lower risk than for CI tests, but pinning ensures predictable behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cleanup-stale-schemas.yml around lines 42 - 46, The workflow step named "Install dbt" currently installs "dbt-core" without a version, causing non-reproducible installs; update the run command that installs "dbt-core" (the pip install line referencing "dbt-core" and the matrix.warehouse-type expansion) to pin a specific dbt-core version (e.g., "dbt-core==<desired_version>") and ensure any dynamically chosen adapter package strings remain compatible with that pinned version.
66-71: Workflow uses only["dbt_"]prefix while macro defaults to['dbt_', 'py_'].The macro's default prefixes include both
dbt_andpy_, but the workflow explicitly passes only["dbt_"]. This may be intentional, but ifpy_prefixed schemas also need cleanup, consider including it:- --args '{prefixes: ["dbt_"], max_age_hours: ${{ inputs.max-age-hours || '24' }}}' + --args '{prefixes: ["dbt_", "py_"], max_age_hours: ${{ inputs.max-age-hours || '24' }}}'Or document why only
dbt_schemas are cleaned in this workflow.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In @.github/workflows/cleanup-stale-schemas.yml around lines 66 - 71, The workflow step calling the macro elementary.drop_stale_ci_schemas passes --args with prefixes: ["dbt_"] while the macro defaults to ['dbt_', 'py_']; update the workflow run arguments to include both prefixes (e.g., prefixes: ["dbt_", "py_"]) so py_ schemas are also dropped, or alternatively add a brief comment in the workflow explaining why only "dbt_" is passed if that was intentional; target the run invocation that builds the --args payload for elementary.drop_stale_ci_schemas.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql`:
- Around line 45-46: The current call to
modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') can raise ValueError
for malformed dates; before calling strptime (inside the ns_valid.ok branch
where schema_ts is set), validate the numeric ranges of the parsed components
from ts_str (year/day/month/hour/minute/second substrings) — e.g., month 01-12,
day 01-31, hour 00-23, minute/second 00-59 — and only call
modules.datetime.datetime.strptime to create schema_ts when those checks pass;
if validation fails, skip that schema (do not call strptime) so malformed ts_str
values don’t abort the cleanup run.
- Line 23: The code sets now = modules.datetime.datetime.utcnow(), which uses a
deprecated naive UTC; change this to a timezone-aware UTC datetime by calling
modules.datetime.datetime.now(modules.datetime.timezone.utc) (or equivalent)
when assigning now in the macro, and remove any .replace(tzinfo=None) usage so
timezone info is preserved; also audit places that compare now to other
datetimes (e.g., functions referencing now) and update those comparisons to use
timezone-aware datetimes so no naive vs aware comparisons occur.
---
Nitpick comments:
In @.github/workflows/cleanup-stale-schemas.yml:
- Around line 42-46: The workflow step named "Install dbt" currently installs
"dbt-core" without a version, causing non-reproducible installs; update the run
command that installs "dbt-core" (the pip install line referencing "dbt-core"
and the matrix.warehouse-type expansion) to pin a specific dbt-core version
(e.g., "dbt-core==<desired_version>") and ensure any dynamically chosen adapter
package strings remain compatible with that pinned version.
- Around line 66-71: The workflow step calling the macro
elementary.drop_stale_ci_schemas passes --args with prefixes: ["dbt_"] while the
macro defaults to ['dbt_', 'py_']; update the workflow run arguments to include
both prefixes (e.g., prefixes: ["dbt_", "py_"]) so py_ schemas are also dropped,
or alternatively add a brief comment in the workflow explaining why only "dbt_"
is passed if that was intentional; target the run invocation that builds the
--args payload for elementary.drop_stale_ci_schemas.
In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql`:
- Around line 77-80: The DROP DATABASE statement in macro
clickhouse__drop_ci_schema uses an unquoted schema_name; update the run_query
call to quote the identifier (e.g., wrap schema_name in backticks) or use the
adapter's identifier-quoting helper if available, so run_query("DROP DATABASE IF
EXISTS " ~ quoted_schema_name) is executed and then call adapter.commit() as
before; reference the macro clickhouse__drop_ci_schema, the schema_name
variable, the run_query invocation, and adapter.commit() when making the change.
- Around line 48-51: The loop currently calls elementary.drop_ci_schema directly
so any failure aborts the whole run; change it to attempt each drop inside a
guarded block (wrap the call to drop_ci_schema in a try/catch-equivalent Jinja
construct), only increment ns.dropped on success, and on failure increment a new
ns.failed (or ns.errors) counter and call log(...) with the schema_name and the
captured error details so the job continues cleaning other schemas; at the end
of the macro emit a summary using ns.dropped and ns.failed to report which drops
succeeded and which failed.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
.github/workflows/cleanup-stale-schemas.ymlmacros/utils/ci_cleanup/drop_stale_ci_schemas.sql
|
|
||
| {% set database = elementary.target_database() %} | ||
| {% set all_schemas = adapter.list_schemas(database) %} | ||
| {% set now = modules.datetime.datetime.utcnow() %} |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
Python 3.12 datetime.utcnow() deprecation replacement
💡 Result:
In Python 3.12, datetime.datetime.utcnow() is deprecated; use a timezone-aware UTC datetime instead. (docs.python.org)
Replacement
from datetime import datetime, UTC
now_utc = datetime.now(UTC)If you need compatibility with older Python versions
from datetime import datetime, timezone
now_utc = datetime.now(timezone.utc)Related: utcfromtimestamp() replacement
from datetime import datetime, UTC
dt_utc = datetime.fromtimestamp(ts, tz=UTC)Citations:
- 1: https://docs.python.org/3/whatsnew/3.12.html?utm_source=openai
- 2: https://docs.python.org/3/whatsnew/3.12.html?utm_source=openai
- 3: https://docs.python.org/3.11/library/datetime.html?utm_source=openai
- 4: https://docs.python.org/3/whatsnew/3.12.html?utm_source=openai
datetime.utcnow() is deprecated in Python 3.12+.
Use datetime.now(timezone.utc) instead. The suggested fix should avoid stripping the timezone information with .replace(tzinfo=None), as this defeats the purpose of using timezone-aware datetimes. If the code elsewhere compares with naive datetimes, those should be updated to be timezone-aware as well.
♻️ Suggested fix
- {% set now = modules.datetime.datetime.utcnow() %}
+ {% set now = modules.datetime.datetime.now(modules.datetime.timezone.utc) %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql` at line 23, The code sets
now = modules.datetime.datetime.utcnow(), which uses a deprecated naive UTC;
change this to a timezone-aware UTC datetime by calling
modules.datetime.datetime.now(modules.datetime.timezone.utc) (or equivalent)
when assigning now in the macro, and remove any .replace(tzinfo=None) usage so
timezone info is preserved; also audit places that compare now to other
datetimes (e.g., functions referencing now) and update those comparisons to use
timezone-aware datetimes so no naive vs aware comparisons occur.
| {% if ns_valid.ok %} | ||
| {% set schema_ts = modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') %} |
There was a problem hiding this comment.
strptime can raise ValueError for invalid date values.
The digit validation ensures characters are numeric but doesn't validate actual date ranges. For example, 260232_112345 (day 32) would pass digit validation but cause strptime to raise ValueError, aborting the entire cleanup run.
Consider wrapping in a try/except to skip malformed schemas gracefully:
🛡️ Suggested defensive handling
{% if ns_valid.ok %}
- {% set schema_ts = modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') %}
- {% set age_seconds = (now - schema_ts).total_seconds() %}
- {% if age_seconds > max_age_seconds %}
+ {% set schema_ts = none %}
+ {% try %}
+ {% set schema_ts = modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') %}
+ {% except %}
+ {{ log(" skip " ~ schema_name ~ " (invalid timestamp: " ~ ts_str ~ ")", info=true) }}
+ {% endtry %}
+ {% if schema_ts is not none %}
+ {% set age_seconds = (now - schema_ts).total_seconds() %}
+ {% if age_seconds > max_age_seconds %}Note: Jinja2 doesn't have native try/except. You may need to validate month (01-12) and day (01-31) ranges explicitly, or accept the current behavior given CI schemas are auto-generated with valid timestamps.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql` around lines 45 - 46, The
current call to modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') can
raise ValueError for malformed dates; before calling strptime (inside the
ns_valid.ok branch where schema_ts is set), validate the numeric ranges of the
parsed components from ts_str (year/day/month/hour/minute/second substrings) —
e.g., month 01-12, day 01-31, hour 00-23, minute/second 00-59 — and only call
modules.datetime.datetime.strptime to create schema_ts when those checks pass;
if validation fails, skip that schema (do not call strptime) so malformed ts_str
values don’t abort the cleanup run.
There was a problem hiding this comment.
🧹 Nitpick comments (1)
integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql (1)
39-43: Consider guarding the recent schema drop.Line 43 unconditionally drops
recent_schema, but if the cleanup logic unexpectedly dropped it (e.g., due to a bug or race condition), this could fail on adapters that don't supportIF EXISTSsemantics in theirdrop_ci_schemaimplementation.However, looking at the main macro in
macros/utils/ci_cleanup/drop_stale_ci_schemas.sql, the default implementation usesdbt.drop_schemawhich typically handles non-existent schemas gracefully. This is likely fine.🔧 Optional: Add existence check for consistency
{% if old_exists_after %} {% do elementary.drop_ci_schema(database, old_schema) %} {% endif %} - {% do elementary.drop_ci_schema(database, recent_schema) %} + {% if recent_exists_after %} + {% do elementary.drop_ci_schema(database, recent_schema) %} + {% endif %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql` around lines 39 - 43, The test unconditionally calls elementary.drop_ci_schema(database, recent_schema) which can fail if the schema was already removed; wrap that call in the same existence guard used above (e.g., mirror the old_exists_after conditional or compute a recent_exists_after before calling) so you only call drop_ci_schema(database, recent_schema) when the recent schema actually exists; reference the existing symbols recent_schema, old_exists_after, old_schema and the macro/function drop_ci_schema to locate where to add the conditional guard.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql`:
- Around line 39-43: The test unconditionally calls
elementary.drop_ci_schema(database, recent_schema) which can fail if the schema
was already removed; wrap that call in the same existence guard used above
(e.g., mirror the old_exists_after conditional or compute a recent_exists_after
before calling) so you only call drop_ci_schema(database, recent_schema) when
the recent schema actually exists; reference the existing symbols recent_schema,
old_exists_after, old_schema and the macro/function drop_ci_schema to locate
where to add the conditional guard.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sqlintegration_tests/tests/test_ci_cleanup.py
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql`:
- Around line 70-73: The macro clickhouse__edr_create_schema currently
concatenates schema_name into SQL and ignores the database param; instead
construct a dbt Relation via api.Relation.create(database=database,
schema=schema_name) (or at minimum use adapter.quote_identifier on both database
and schema_name) and use that quoted identifier in the run_query call, then call
adapter.commit(); update clickhouse__edr_create_schema to build the relation
with api.Relation.create(database=..., schema=...) and use the adapter’s quoting
helpers (adapter.quote_identifier or relation rendering) for the CREATE DATABASE
IF NOT EXISTS statement before committing.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql
| {% macro clickhouse__edr_create_schema(database, schema_name) %} | ||
| {% do run_query("CREATE DATABASE IF NOT EXISTS " ~ schema_name) %} | ||
| {% do adapter.commit() %} | ||
| {% endmacro %} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate the file and read the specific lines
find . -name "test_drop_stale_ci_schemas.sql" -type f | head -1Repository: elementary-data/dbt-data-reliability
Length of output: 149
🏁 Script executed:
# Check the actual content around lines 70-73
if [ -f "integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql" ]; then
cat -n "integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql" | sed -n '65,80p'
fiRepository: elementary-data/dbt-data-reliability
Length of output: 498
🏁 Script executed:
# Search for other ClickHouse macro implementations to understand patterns
rg "clickhouse__" --type sql -A 3 | head -100Repository: elementary-data/dbt-data-reliability
Length of output: 7839
🏁 Script executed:
# Check for adapter.quote usage in ClickHouse macros
rg "adapter\.quote" --type sql -B 2 -A 2 | head -80Repository: elementary-data/dbt-data-reliability
Length of output: 707
🏁 Script executed:
# Search for CREATE DATABASE usage in the codebase
rg "CREATE DATABASE" --type sql -B 2 -A 2Repository: elementary-data/dbt-data-reliability
Length of output: 597
🏁 Script executed:
# Find the default implementation of edr_create_schema
rg "macro edr_create_schema" --type sql -A 5 | grep -v "clickhouse__" | head -20Repository: elementary-data/dbt-data-reliability
Length of output: 807
🏁 Script executed:
# Check the database parameter usage in similar default macros
rg "edr_drop_schema|edr_create_schema" --type sql -B 1 -A 4 | grep -E "(macro|database)" | head -20Repository: elementary-data/dbt-data-reliability
Length of output: 2536
Quote the identifier and align with the interface contract.
The schema_name parameter is directly concatenated into the SQL query without quoting. While this works in the test context with programmatically generated names, it's fragile if names contain special characters. More importantly, the database parameter is accepted but unused, inconsistent with the default implementation which properly uses it via api.Relation.create().
♻️ Proposed fix
{% macro clickhouse__edr_create_schema(database, schema_name) %}
- {% do run_query("CREATE DATABASE IF NOT EXISTS " ~ schema_name) %}
+ {% do run_query("CREATE DATABASE IF NOT EXISTS " ~ adapter.quote(schema_name)) %}
{% do adapter.commit() %}
{% endmacro %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql` around
lines 70 - 73, The macro clickhouse__edr_create_schema currently concatenates
schema_name into SQL and ignores the database param; instead construct a dbt
Relation via api.Relation.create(database=database, schema=schema_name) (or at
minimum use adapter.quote_identifier on both database and schema_name) and use
that quoted identifier in the run_query call, then call adapter.commit(); update
clickhouse__edr_create_schema to build the relation with
api.Relation.create(database=..., schema=...) and use the adapter’s quoting
helpers (adapter.quote_identifier or relation rendering) for the CREATE DATABASE
IF NOT EXISTS statement before committing.
…eanup-stale-ci-schemas
…ntary.schema_exists in test ClickHouse adapter doesn't expose adapter.list_schemas() or adapter.check_schema_exists() on its RuntimeDatabaseWrapper. Add a dispatched list_ci_schemas macro (SHOW DATABASES for ClickHouse, adapter.list_schemas for others) and use elementary.schema_exists() in the test helper. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
ClickHouse adapter doesn't support adapter.check_schema_exists() on RuntimeDatabaseWrapper. Add a ClickHouse-specific dispatch that uses SHOW DATABASES LIKE query instead. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (2)
macros/utils/ci_cleanup/drop_stale_ci_schemas.sql (1)
45-47:⚠️ Potential issue | 🟠 MajorGuard timestamp parsing against invalid calendar values.
Line 46 can still raise on values like invalid day/month and abort the whole cleanup operation. This was already flagged in an earlier review and is still unresolved.
♻️ Proposed fix
{% if ns_valid.ok %} - {% set schema_ts = modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') %} - {% set age_seconds = (now - schema_ts).total_seconds() %} - {% if age_seconds > max_age_seconds %} + {% set yy = ts_str[0:2] | int %} + {% set mm = ts_str[2:4] | int %} + {% set dd = ts_str[4:6] | int %} + {% set hh = ts_str[7:9] | int %} + {% set mi = ts_str[9:11] | int %} + {% set ss = ts_str[11:13] | int %} + {% if 1 <= mm <= 12 and 1 <= dd <= 31 and 0 <= hh <= 23 and 0 <= mi <= 59 and 0 <= ss <= 59 %} + {% set schema_ts = modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') %} + {% set age_seconds = (now - schema_ts).total_seconds() %} + {% if age_seconds > max_age_seconds %} + {{ log(" DROP " ~ schema_name ~ " (age: " ~ (age_seconds / 3600) | round(1) ~ " h)", info=true) }} + {% do elementary.drop_ci_schema(database, schema_name) %} + {% set ns.dropped = ns.dropped + 1 %} + {% else %} + {{ log(" keep " ~ schema_name ~ " (age: " ~ (age_seconds / 3600) | round(1) ~ " h)", info=true) }} + {% endif %} {% else %} - {{ log(" keep " ~ schema_name ~ " (age: " ~ (age_seconds / 3600) | round(1) ~ " h)", info=true) }} + {{ log(" skip " ~ schema_name ~ " (invalid timestamp: " ~ ts_str ~ ")", info=true) }} {% endif %} {% endif %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql` around lines 45 - 47, Guard the timestamp parse by wrapping the modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') call (the schema_ts computation) in a try/except so invalid calendar values (bad day/month, etc.) do not raise and abort the cleanup; on exception, log or mark the schema as invalid and continue (i.e., skip computing age_seconds and avoid using schema_ts) so the rest of the loop proceeds safely when ns_valid.ok is true.integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql (1)
70-73:⚠️ Potential issue | 🟠 MajorQuote
schema_namein ClickHouseCREATE DATABASEstatement.This is the same unresolved issue raised earlier: direct concatenation is fragile and the
databasearg remains unused in this implementation path.♻️ Proposed fix
{% macro clickhouse__edr_create_schema(database, schema_name) %} - {% do run_query("CREATE DATABASE IF NOT EXISTS " ~ schema_name) %} + {% do run_query("CREATE DATABASE IF NOT EXISTS " ~ adapter.quote(schema_name)) %} {% do adapter.commit() %} {% endmacro %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql` around lines 70 - 73, The macro clickhouse__edr_create_schema currently concatenates schema_name directly and ignores the database arg; change it to build a fully qualified name using the database argument when present and pass that through the SQL identifier-quoting helper instead of raw concatenation (e.g., compute full_name = database ? database ~ "." ~ schema_name : schema_name, then call run_query("CREATE DATABASE IF NOT EXISTS " ~ adapter.quote(full_name)) ), and remove the direct concatenation so schema_name is properly quoted and database is used.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql`:
- Line 11: Replace the deprecated call modules.datetime.datetime.utcnow() used
to set the now variable with a timezone-aware call: use
modules.datetime.datetime.now(modules.datetime.timezone.utc) so the now variable
is UTC-aware; update the expression where now is set (the line containing
modules.datetime.datetime.utcnow()) to call .now(...) with
modules.datetime.timezone.utc to avoid the deprecation and runtime warnings.
In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql`:
- Around line 99-101: The macro clickhouse__drop_ci_schema currently injects
schema_name directly into the SQL passed to run_query; change it to quote/escape
the identifier before building the DROP DATABASE statement (e.g., wrap the
schema name in backticks or use the adapter's identifier-quoting helper) and
ensure any existing quote characters in schema_name are escaped to avoid SQL
injection or syntax errors; update the run_query call that builds "DROP DATABASE
IF EXISTS ..." to use the quoted/escaped schema identifier instead of raw
schema_name.
- Around line 16-25: The macro drop_stale_ci_schemas can receive max_age_hours
as a string which breaks the seconds calculation; coerce max_age_hours to an
integer before computing max_age_seconds so the math is valid (e.g., use Jinja's
int filter or an explicit int cast) and update the line that sets
max_age_seconds to multiply the integer value by 3600; keep the change inside
the drop_stale_ci_schemas macro and reference the max_age_hours and
max_age_seconds symbols.
---
Duplicate comments:
In `@integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql`:
- Around line 70-73: The macro clickhouse__edr_create_schema currently
concatenates schema_name directly and ignores the database arg; change it to
build a fully qualified name using the database argument when present and pass
that through the SQL identifier-quoting helper instead of raw concatenation
(e.g., compute full_name = database ? database ~ "." ~ schema_name :
schema_name, then call run_query("CREATE DATABASE IF NOT EXISTS " ~
adapter.quote(full_name)) ), and remove the direct concatenation so schema_name
is properly quoted and database is used.
In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql`:
- Around line 45-47: Guard the timestamp parse by wrapping the
modules.datetime.datetime.strptime(ts_str, '%y%m%d_%H%M%S') call (the schema_ts
computation) in a try/except so invalid calendar values (bad day/month, etc.) do
not raise and abort the cleanup; on exception, log or mark the schema as invalid
and continue (i.e., skip computing age_seconds and avoid using schema_ts) so the
rest of the loop proceeds safely when ns_valid.ok is true.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sqlmacros/utils/ci_cleanup/drop_stale_ci_schemas.sql
|
|
||
| {% macro test_drop_stale_ci_schemas() %} | ||
| {% set database = elementary.target_database() %} | ||
| {% set now = modules.datetime.datetime.utcnow() %} |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "test_drop_stale_ci_schemas.sql" -type fRepository: elementary-data/dbt-data-reliability
Length of output: 149
🌐 Web query:
Python 3.12 datetime.utcnow() deprecation removal
💡 Result:
datetime.datetime.utcnow()(andutcfromtimestamp()) were deprecated in Python 3.12 and now emit aDeprecationWarningsaying they’re “scheduled for removal in a future version.” The official replacement is timezone-aware UTC datetimes, e.g.datetime.datetime.now(tz=datetime.UTC). [1]- As of the Python 3.15.0a0 docs,
utcnow()is still present (still marked “Deprecated since version 3.12”), which indicates it has not been removed yet, and no specific removal version is stated in the docs. [2] - The deprecation notice is still listed in Python 3.14 deprecations as well. [3]
Recommended migration:
from datetime import datetime, UTC
dt = datetime.now(UTC) # instead of datetime.utcnow()Sources: [1] [2] [3]
🏁 Script executed:
cat -n ./integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql | head -30Repository: elementary-data/dbt-data-reliability
Length of output: 1691
🏁 Script executed:
# Check if there are other datetime usages in the codebase to understand patterns
rg "modules\.datetime" integration_tests/dbt_project/macros/ -A 2 -B 2
# Also check dbt version or requirements
fd requirements -type f -name "*.txt" -o -name "*.yml" -o -name "setup.py" | head -5Repository: elementary-data/dbt-data-reliability
Length of output: 818
🌐 Web query:
dbt Jinja modules.datetime.timezone Python version compatibility
💡 Result:
modules.datetime.timezone is part of Python’s standard library datetime module, and it’s available in all modern Python 3 versions that dbt supports—so it’s not a dbt/Python-version incompatibility issue in practice.
- In dbt’s Jinja context, the
modulesobject exposes Python stdlib modules, andmodules.datetimeis commonly used for date/time logic in macros (e.g.,modules.datetime.datetime.now(),modules.datetime.timedelta(...)). That same object should also exposemodules.datetime.timezone(e.g.,modules.datetime.timezone.utc). [3] - Current dbt-core releases (e.g., dbt-core 1.11.6, released Feb 17, 2026) require Python >= 3.10 and declare support for Python 3.10–3.13. So
datetime.timezonewill be present. [1] - dbt-core 1.11 explicitly dropped Python 3.9, reinforcing that you’re in “modern Python” territory where
datetime.timezoneexists. [2]
If you’re seeing an error like “timezone is undefined” or “has no attribute timezone”, it’s more likely due to how the Jinja context is being executed (e.g., a non-dbt Jinja renderer, older tooling that doesn’t provide modules, or a linter/templater mismatch), not Python’s datetime module.
Sources: [1] [2] [3]
Use timezone-aware UTC instead of utcnow() in the test macro.
The datetime.utcnow() method is deprecated in Python 3.12+ and will be removed in a future version. Replace it with datetime.now(timezone.utc) to maintain forward compatibility and avoid runtime warnings.
♻️ Proposed fix
- {% set now = modules.datetime.datetime.utcnow() %}
+ {% set now = modules.datetime.datetime.now(modules.datetime.timezone.utc) %}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {% set now = modules.datetime.datetime.utcnow() %} | |
| {% set now = modules.datetime.datetime.now(modules.datetime.timezone.utc) %} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@integration_tests/dbt_project/macros/test_drop_stale_ci_schemas.sql` at line
11, Replace the deprecated call modules.datetime.datetime.utcnow() used to set
the now variable with a timezone-aware call: use
modules.datetime.datetime.now(modules.datetime.timezone.utc) so the now variable
is UTC-aware; update the expression where now is set (the line containing
modules.datetime.datetime.utcnow()) to call .now(...) with
modules.datetime.timezone.utc to avoid the deprecation and runtime warnings.
| {% macro clickhouse__drop_ci_schema(database, schema_name) %} | ||
| {% do run_query("DROP DATABASE IF EXISTS " ~ schema_name) %} | ||
| {% do adapter.commit() %} |
There was a problem hiding this comment.
Quote the ClickHouse database identifier before executing DROP DATABASE.
Line 100 concatenates schema_name directly into SQL, which is brittle for non-simple identifiers.
♻️ Proposed fix
{% macro clickhouse__drop_ci_schema(database, schema_name) %}
- {% do run_query("DROP DATABASE IF EXISTS " ~ schema_name) %}
+ {% do run_query("DROP DATABASE IF EXISTS " ~ adapter.quote(schema_name)) %}
{% do adapter.commit() %}
{% endmacro %}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {% macro clickhouse__drop_ci_schema(database, schema_name) %} | |
| {% do run_query("DROP DATABASE IF EXISTS " ~ schema_name) %} | |
| {% do adapter.commit() %} | |
| {% macro clickhouse__drop_ci_schema(database, schema_name) %} | |
| {% do run_query("DROP DATABASE IF EXISTS " ~ adapter.quote(schema_name)) %} | |
| {% do adapter.commit() %} | |
| {% endmacro %} |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@macros/utils/ci_cleanup/drop_stale_ci_schemas.sql` around lines 99 - 101, The
macro clickhouse__drop_ci_schema currently injects schema_name directly into the
SQL passed to run_query; change it to quote/escape the identifier before
building the DROP DATABASE statement (e.g., wrap the schema name in backticks or
use the adapter's identifier-quoting helper) and ensure any existing quote
characters in schema_name are escaped to avoid SQL injection or syntax errors;
update the run_query call that builds "DROP DATABASE IF EXISTS ..." to use the
quoted/escaped schema identifier instead of raw schema_name.
| from dbt_project import DbtProject | ||
|
|
||
|
|
||
| # Dremio does not support DROP SCHEMA reliably; skip it. |
There was a problem hiding this comment.
I see that Dremio has "DROP FOLDER", should we perhaps use that?
There was a problem hiding this comment.
Good point! I looked into it — Dremio's DROP FOLDER is specifically for Arctic/Nessie catalog folders, not general schemas/spaces.
Looking at the existing codebase, clear_env.sql doesn't have a Dremio-specific dispatch — it falls through to the default dbt.drop_schema, which the dbt-dremio adapter handles internally. So the cleanup macro's default dbt.drop_schema path should already work for Dremio.
The skip was a conservative choice since I wasn't sure if Dremio could create arbitrary schemas via dbt.create_schema (which the test needs). I can try removing the skip and let CI tell us if it works — would you like me to do that?
There was a problem hiding this comment.
Done — removed the Dremio skip in fe9ff60. CI will tell us if it works. Also fixed two other issues in the same push:\n- Double-encoding bug (test macro was returning tojson(results) but log_macro_results already calls tojson)\n- ClickHouse schema_exists now uses exact-match (system.databases WHERE name = ...) instead of SHOW DATABASES LIKE which treats _ as wildcard
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@macros/utils/cross_db_utils/schema_exists.sql`:
- Around line 33-34: Replace the LIKE-based SHOW DATABASES check with an
exact-match query against ClickHouse's system.databases: instead of calling
run_query("SHOW DATABASES LIKE '" ~ schema ~ "'") use run_query with a SELECT
from system.databases where name = schema (e.g., "SELECT name FROM
system.databases WHERE name = '" ~ schema ~ "'") so the return(result | length >
0) remains valid; update the occurrence that constructs the SQL string around
the schema variable to ensure proper quoting/escaping and keep the rest of the
logic (the run_query call and the return(result | length > 0)) unchanged.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
macros/utils/cross_db_utils/schema_exists.sql
…tch for ClickHouse schema_exists Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
- Replace deprecated utcnow() with now(timezone.utc) in both macros - Coerce max_age_hours to int before computation - Guard strptime with date component range validation - Quote ClickHouse identifiers with backticks in CREATE/DROP DATABASE - Add comment explaining why workflow only uses dbt_ prefix Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…dbt Jinja context Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
adapter.list_schemas() and adapter.check_schema_exists() are not available on RuntimeDatabaseWrapper in run-operation context. Replace with: - information_schema.schemata for default list/exists - system.databases for ClickHouse - New ci_schema_exists dispatched macro for test verification Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…mentations BigQuery requires project-qualified INFORMATION_SCHEMA.SCHEMATA queries (`project`.INFORMATION_SCHEMA.SCHEMATA) unlike standard databases. Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…h standard schema ops Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…CI schemas Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
|
|
||
| {% macro drop_stale_ci_schemas(prefixes=None, max_age_hours=24) %} | ||
| {% if prefixes is none %} | ||
| {% set prefixes = ['dbt_', 'py_'] %} |
There was a problem hiding this comment.
let's make prefixes a mandatory input, and fail if it is not specified
There was a problem hiding this comment.
Done — prefixes now defaults to none and raises a CompilationError with a clear message if not provided or empty.
| {% if schema_lower.startswith(prefix.lower()) %} | ||
| {% set remainder = schema_lower[prefix | length :] %} | ||
| {# Timestamp format: YYMMDD_HHMMSS (13 chars) followed by _ #} | ||
| {% if remainder | length >= 14 and remainder[6:7] == '_' and remainder[13:14] == '_' %} |
There was a problem hiding this comment.
- could we use a regex with groups to match the branch name parts? it's a bit cleaner no?
modules.reis available in dbt. - let's make a
parse_ci_branch_namemacro and use it here to simplify the main macro.
There was a problem hiding this comment.
Done — replaced the manual character-by-character validation with modules.re.match('^(\\d{6})_(\\d{6})_.+', remainder) and extracted it into a parse_ci_schema_name macro. The main loop is now much cleaner — just calls parse_ci_schema_name(schema_name, prefixes) which returns a datetime or none.
Named it parse_ci_schema_name (instead of parse_ci_branch_name) since it takes a schema name as input and returns the parsed timestamp. Let me know if you'd prefer a different name.
| @@ -0,0 +1,156 @@ | |||
| {# | |||
| drop_stale_ci_schemas – clean up timestamped CI schemas. | |||
There was a problem hiding this comment.
Could this macro be in the integration tests project?
There was a problem hiding this comment.
Done — moved the entire macro to integration_tests/dbt_project/macros/drop_stale_ci_schemas.sql. The old macros/utils/ci_cleanup/ directory is deleted. Workflow now calls drop_stale_ci_schemas without the elementary. prefix.
|
|
||
| {# ── Per-adapter schema listing ─────────────────────────────────────── #} | ||
|
|
||
| {% macro list_ci_schemas(database) %} |
There was a problem hiding this comment.
Maybe this can be in its own file? (also in the integration tests project)
and also ci_schema_exists.
There was a problem hiding this comment.
Done — list_ci_schemas and ci_schema_exists (with all adapter dispatches) are now in their own file: integration_tests/dbt_project/macros/ci_schema_helpers.sql. All dispatch calls use 'elementary_tests' as the search package.
| {# adapter.list_schemas() is not available in run-operation context | ||
| (RuntimeDatabaseWrapper does not expose it). Use information_schema instead. #} | ||
| {% macro default__list_ci_schemas(database) %} | ||
| {% set results = run_query("SELECT schema_name FROM information_schema.schemata") %} |
There was a problem hiding this comment.
As a safety, can we just list schemas in the CI db? (the one in the profiles)
There was a problem hiding this comment.
Done — added WHERE lower(catalog_name) = lower('<database>') filter to the default list_ci_schemas implementation. BigQuery was already scoped via project-qualified INFORMATION_SCHEMA. ClickHouse lists all databases (no catalog concept).
| {% endmacro %} | ||
|
|
||
| {% macro default__ci_schema_exists(database, schema_name) %} | ||
| {% set result = run_query("SELECT schema_name FROM information_schema.schemata WHERE lower(schema_name) = lower('" ~ schema_name ~ "')") %} |
There was a problem hiding this comment.
likewise, only consider schemas in the CI database.
There was a problem hiding this comment.
Done — added WHERE lower(catalog_name) = lower('<database>') AND ... filter to the default ci_schema_exists implementation as well.
|
Thanks for the thorough review! Addressing all comments now:
Note: moving the macro out of the main package means the |
… mandatory prefixes Addresses review comments: - Move drop_stale_ci_schemas and helpers from main package to integration_tests - Split list_ci_schemas and ci_schema_exists into ci_schema_helpers.sql - Make prefixes a mandatory parameter (error if not provided) - Scope information_schema queries to CI database via catalog_name filter - Replace manual char validation with modules.re regex in parse_ci_schema_name - Update dispatch package from 'elementary' to 'elementary_tests' - Update workflow to call macros without elementary. prefix Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
|
|
||
| {# ── CI schema name parser ─────────────────────────────────────────── #} | ||
|
|
||
| {% macro parse_ci_schema_name(schema_name, prefixes) %} |
There was a problem hiding this comment.
Maybe it should be parse_timestamp_from_ci_schema_name
There was a problem hiding this comment.
Done — renamed to parse_timestamp_from_ci_schema_name.
| {% if match %} | ||
| {% set ts_str = match.group(1) ~ '_' ~ match.group(2) %} | ||
| {# Validate date component ranges before calling strptime #} | ||
| {% set mm = ts_str[2:4] | int %} |
There was a problem hiding this comment.
as we're anyway parsing all the different parts, maybe:
- The groups should be each individual component (e.g.
set mm = match.group(...)). - For readability, can we use named groups?
- Since we already parse all the parts, we can just initialize the datetime object directly, e.g. datetime(2000 + yy, mm, ....)
There was a problem hiding this comment.
Done — now uses named groups ((?P<yy>\\d{2})(?P<mm>\\d{2}) etc.), extracts each component via match.group('yy'), and constructs the datetime directly with datetime(2000 + yy, mm, dd, HH, MI, SS) instead of strptime.
| {% endmacro %} | ||
|
|
||
|
|
||
| {# ── Per-adapter schema drop ─────────────────────────────────────────── #} |
There was a problem hiding this comment.
Maybe drop_ci_schema should also be in the schema helpers?
There was a problem hiding this comment.
Done — yes, moved drop_ci_schema to schema_utils/drop_ci_schema.sql alongside the other helpers.
| #} | ||
|
|
||
| {# ── Per-adapter schema listing (scoped to CI database) ────────────── #} | ||
|
|
There was a problem hiding this comment.
I think it may be good to create a folder called schema_utils with each one of these macros in a separate file.
There was a problem hiding this comment.
Done — created schema_utils/ folder with each macro in its own file:
schema_utils/list_ci_schemas.sqlschema_utils/ci_schema_exists.sqlschema_utils/drop_ci_schema.sqlschema_utils/parse_timestamp_from_ci_schema_name.sql
…e parse macro Addresses review comments: - Create schema_utils/ folder with each helper in its own file: list_ci_schemas, ci_schema_exists, drop_ci_schema, parse_timestamp_from_ci_schema_name - Rename parse_ci_schema_name to parse_timestamp_from_ci_schema_name - Use named regex groups: (?P<yy>), (?P<mm>), etc. - Construct datetime directly via datetime(2000+yy, mm, dd, HH, MI, SS) instead of strptime - Move drop_ci_schema from main macro file to schema_utils/ - Delete old ci_schema_helpers.sql (replaced by individual files) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…I-specific) Addresses review comment: - schema_utils/ contains generic helpers: list_schemas_sql, schema_exists_sql, drop_schema_sql (renamed from ci_ prefixed versions) - ci_schemas_cleanup/ contains CI-specific macros: drop_stale_ci_schemas, parse_timestamp_from_ci_schema_name, test_drop_stale_ci_schemas Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
…nternals
dbt-databricks defines list_schemas_sql and check_schema_exists_sql as
internal helper macros. Our identically-named macros were overriding them,
causing dbt run to send [] (Python list) as SQL instead of SHOW SCHEMAS.
Renamed: list_schemas_sql -> edr_list_schemas
schema_exists_sql -> edr_schema_exists
drop_schema_sql -> edr_drop_schema
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
🧹 Nitpick comments (4)
integration_tests/dbt_project/macros/schema_utils/schema_exists.sql (3)
13-16: Schema name should be escaped to prevent SQL injection.Both
databaseandschema_nameare directly concatenated into SQL. Theschema_nameparameter originates fromedr_list_schemasresults, which could theoretically contain quotes in edge cases.♻️ Suggested improvement
{% macro default__edr_schema_exists(database, schema_name) %} - {% set result = run_query("SELECT schema_name FROM information_schema.schemata WHERE lower(catalog_name) = lower('" ~ database ~ "') AND lower(schema_name) = lower('" ~ schema_name ~ "')") %} + {% set safe_db = database | replace("'", "''") %} + {% set safe_schema = schema_name | replace("'", "''") %} + {% set result = run_query("SELECT schema_name FROM information_schema.schemata WHERE lower(catalog_name) = lower('" ~ safe_db ~ "') AND lower(schema_name) = lower('" ~ safe_schema ~ "')") %} {% do return(result | length > 0) %} {% endmacro %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/dbt_project/macros/schema_utils/schema_exists.sql` around lines 13 - 16, The macro default__edr_schema_exists concatenates user-controlled database and schema_name into SQL, risking injection; update it to safely quote/escape both identifiers before embedding in run_query (use the dbt adapter quoting helpers such as adapter.quote or adapter.quote_identifier or otherwise escape single quotes via replace) so the SQL string passed to run_query only contains properly quoted identifiers; locate the macro default__edr_schema_exists and replace direct concatenation of database and schema_name with adapter-based quoting (or explicit escaping) when building the SELECT against information_schema.schemata.
23-26: Same escaping improvement applies to ClickHouse variant.♻️ Suggested improvement
{% macro clickhouse__edr_schema_exists(database, schema_name) %} - {% set result = run_query("SELECT 1 FROM system.databases WHERE name = '" ~ schema_name ~ "' LIMIT 1") %} + {% set safe_schema = schema_name | replace("'", "''") %} + {% set result = run_query("SELECT 1 FROM system.databases WHERE name = '" ~ safe_schema ~ "' LIMIT 1") %} {% do return(result | length > 0) %} {% endmacro %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/dbt_project/macros/schema_utils/schema_exists.sql` around lines 23 - 26, The ClickHouse macro clickhouse__edr_schema_exists builds a query by concatenating schema_name into a single-quoted string (run_query("SELECT 1 FROM system.databases WHERE name = '" ~ schema_name ~ "' LIMIT 1")), which can break for names with quotes or special chars; update the macro to safely quote/escape the schema identifier before injecting it into run_query (e.g., use the adapter quoting helper such as adapter.quote_identifier(schema_name) or an equivalent safe-escaping function) and keep the rest of the logic (run_query and the length check) unchanged.
18-21: Same escaping improvement applies to BigQuery variant.♻️ Suggested improvement
{% macro bigquery__edr_schema_exists(database, schema_name) %} - {% set result = run_query("SELECT schema_name FROM `" ~ database ~ "`.INFORMATION_SCHEMA.SCHEMATA WHERE lower(schema_name) = lower('" ~ schema_name ~ "')") %} + {% set safe_schema = schema_name | replace("'", "''") %} + {% set result = run_query("SELECT schema_name FROM `" ~ database ~ "`.INFORMATION_SCHEMA.SCHEMATA WHERE lower(schema_name) = lower('" ~ safe_schema ~ "')") %} {% do return(result | length > 0) %} {% endmacro %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/dbt_project/macros/schema_utils/schema_exists.sql` around lines 18 - 21, The BigQuery macro bigquery__edr_schema_exists builds a SQL string with unescaped database and schema_name which can break queries or allow injection; fix it by escaping inputs before constructing the query: in bigquery__edr_schema_exists apply safe escaping to schema_name (e.g., schema_name | replace("'", "''")) and to database (e.g., database | replace("`", "``") or otherwise ensure it's a trusted identifier) when interpolating into the run_query call so the generated SELECT uses a safely quoted string literal for schema_name and a safely quoted identifier for database.integration_tests/dbt_project/macros/schema_utils/list_schemas.sql (1)
13-20: Consider using parameterized queries or proper quoting for the database name.The
databaseparameter is directly concatenated into SQL (line 14). While in this context the value comes fromelementary.target_database()(controlled configuration), this pattern is fragile if a database name ever contains a single quote.For defense-in-depth, consider using dbt's quoting utilities:
♻️ Suggested improvement
{% macro default__edr_list_schemas(database) %} - {% set results = run_query("SELECT schema_name FROM information_schema.schemata WHERE lower(catalog_name) = lower('" ~ database ~ "')") %} + {% set safe_db = database | replace("'", "''") %} + {% set results = run_query("SELECT schema_name FROM information_schema.schemata WHERE lower(catalog_name) = lower('" ~ safe_db ~ "')") %} {% set schemas = [] %}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@integration_tests/dbt_project/macros/schema_utils/list_schemas.sql` around lines 13 - 20, The macro default__edr_list_schemas currently concatenates the database string into SQL via run_query, which can break if the database name contains quotes; update the run_query call to use dbt's quoting helper (e.g., replace the raw concatenation with adapter.quote(database) or adapter.quote_identifier(database)) so the database identifier is properly quoted/escaped before being inserted into the SQL string; keep the rest of the macro (results loop, schemas list, return) intact and reference default__edr_list_schemas and run_query when making the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@integration_tests/dbt_project/macros/schema_utils/list_schemas.sql`:
- Around line 13-20: The macro default__edr_list_schemas currently concatenates
the database string into SQL via run_query, which can break if the database name
contains quotes; update the run_query call to use dbt's quoting helper (e.g.,
replace the raw concatenation with adapter.quote(database) or
adapter.quote_identifier(database)) so the database identifier is properly
quoted/escaped before being inserted into the SQL string; keep the rest of the
macro (results loop, schemas list, return) intact and reference
default__edr_list_schemas and run_query when making the change.
In `@integration_tests/dbt_project/macros/schema_utils/schema_exists.sql`:
- Around line 13-16: The macro default__edr_schema_exists concatenates
user-controlled database and schema_name into SQL, risking injection; update it
to safely quote/escape both identifiers before embedding in run_query (use the
dbt adapter quoting helpers such as adapter.quote or adapter.quote_identifier or
otherwise escape single quotes via replace) so the SQL string passed to
run_query only contains properly quoted identifiers; locate the macro
default__edr_schema_exists and replace direct concatenation of database and
schema_name with adapter-based quoting (or explicit escaping) when building the
SELECT against information_schema.schemata.
- Around line 23-26: The ClickHouse macro clickhouse__edr_schema_exists builds a
query by concatenating schema_name into a single-quoted string
(run_query("SELECT 1 FROM system.databases WHERE name = '" ~ schema_name ~ "'
LIMIT 1")), which can break for names with quotes or special chars; update the
macro to safely quote/escape the schema identifier before injecting it into
run_query (e.g., use the adapter quoting helper such as
adapter.quote_identifier(schema_name) or an equivalent safe-escaping function)
and keep the rest of the logic (run_query and the length check) unchanged.
- Around line 18-21: The BigQuery macro bigquery__edr_schema_exists builds a SQL
string with unescaped database and schema_name which can break queries or allow
injection; fix it by escaping inputs before constructing the query: in
bigquery__edr_schema_exists apply safe escaping to schema_name (e.g.,
schema_name | replace("'", "''")) and to database (e.g., database | replace("`",
"``") or otherwise ensure it's a trusted identifier) when interpolating into the
run_query call so the generated SELECT uses a safely quoted string literal for
schema_name and a safely quoted identifier for database.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (9)
.github/workflows/cleanup-stale-schemas.ymlintegration_tests/dbt_project/macros/ci_schemas_cleanup/drop_stale_ci_schemas.sqlintegration_tests/dbt_project/macros/ci_schemas_cleanup/parse_timestamp_from_ci_schema_name.sqlintegration_tests/dbt_project/macros/ci_schemas_cleanup/test_drop_stale_ci_schemas.sqlintegration_tests/dbt_project/macros/schema_utils/drop_schema.sqlintegration_tests/dbt_project/macros/schema_utils/list_schemas.sqlintegration_tests/dbt_project/macros/schema_utils/schema_exists.sqlintegration_tests/tests/test_ci_cleanup.pymacros/utils/cross_db_utils/schema_exists.sql
🚧 Files skipped from review as they are similar to previous changes (3)
- macros/utils/cross_db_utils/schema_exists.sql
- .github/workflows/cleanup-stale-schemas.yml
- integration_tests/tests/test_ci_cleanup.py
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@integration_tests/dbt_project/macros/schema_utils/schema_exists.sql`:
- Around line 20-23: The SQL currently double-backticks the identifier instead
of using BigQuery's required backslash escape; in the macro
bigquery__edr_schema_exists update the escaping call used in the run_query
construction so backticks in identifiers are replaced with "\`"
(backslash-escaped) rather than "``" and apply the same escaping approach to the
database and schema_name usage within the run_query string (refer to the
replace(...) expression inside the run_query and mirror the fix used in
list_schemas.sql for consistency).
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
integration_tests/dbt_project/macros/ci_schemas_cleanup/drop_stale_ci_schemas.sqlintegration_tests/dbt_project/macros/schema_utils/list_schemas.sqlintegration_tests/dbt_project/macros/schema_utils/schema_exists.sql
🚧 Files skipped from review as they are similar to previous changes (2)
- integration_tests/dbt_project/macros/schema_utils/list_schemas.sql
- integration_tests/dbt_project/macros/ci_schemas_cleanup/drop_stale_ci_schemas.sql
…bit suggestion) Co-Authored-By: Itamar Hartstein <haritamar@gmail.com>
feat: add weekly cleanup workflow for stale CI schemas
Summary
Adds dbt macros and a weekly GitHub Actions workflow to automatically clean up timestamped CI schemas older than a configurable threshold (default: 24 hours). All cleanup macros live in the integration test project (not the production package).
Macro organisation
integration_tests/dbt_project/macros/schema_utils/edr_list_schemas,edr_schema_existsintegration_tests/dbt_project/macros/ci_schemas_cleanup/drop_stale_ci_schemas,parse_timestamp_from_ci_schema_name,test_drop_stale_ci_schemas(+edr_create_schematest helper)integration_tests/dbt_project/macros/clear_env.sql(pre-existing)edr_drop_schema— reused by the cleanup macrodrop_stale_ci_schemas(prefixes, max_age_hours=24):prefixesis mandatory — the macro raises a compiler error if omitted or empty.edr_list_schemasto enumerate schemas via SQL queries (adapter-specific:information_schema.schematafiltered bycatalog_namefor most adapters, project-qualifiedINFORMATION_SCHEMA.SCHEMATAfor BigQuery,SHOW DATABASESfor ClickHouse). Direct SQL is used becauseadapter.list_schemas()is not available inrun-operationcontext.YYMMDD_HHMMSStimestamp from schema names matching<prefix><YYMMDD_HHMMSS>_<branch>_<hash>using regex with named groups ((?P<yy>\\d{2})...), then constructs a datetime directly viadatetime(2000 + yy, mm, dd, HH, MI, SS)(nostrptimecall).max_age_hoursvia the pre-existingedr_drop_schemafromclear_env.sql(default usesdbt.drop_schema; ClickHouse override usesDROP DATABASE IF EXISTS).elementaryrepo (companion PR: feat: add daily cleanup workflow for stale CI schemas elementary#2128).Integration test (
integration_tests/tests/test_ci_cleanup.py):dbt_200101_000000_citest_00000000= year 2020) and a recent schema, runs cleanup with an 8760-hour threshold (1 year), asserts the old schema was dropped and the recent one was kept.edr_schema_existsfor verification.DremioRelationpath model is incompatible with standardapi.Relation.create).Workflow (
.github/workflows/cleanup-stale-schemas.yml):workflow_dispatchwith optionalmax-age-hoursinput).generate_profiles.py, installs deps, and invokes the macro withdbt run-operationfrom the integration test project working directory.Updates since last revision
Databricks naming conflict
During CI investigation, discovered that the original macro names (
list_schemas_sql,schema_exists_sql,drop_schema_sql) were shadowing dbt-databricks internal macros with identical names. This caused Databricks tests to fail withPARSE_SYNTAX_ERRORwhere a Python[]list was sent as SQL instead of a SQL query string.Fix: Renamed all macros to use the
edr_prefix (edr_list_schemas,edr_schema_exists,edr_drop_schema) to avoid conflicts. Theedr_drop_schemamacro already existed inclear_env.sqlfor test environment cleanup, so the duplicatedrop_schema.sqlfile was removed and the cleanup logic now reuses the existing implementation.SQL escaping improvements
Applied CodeRabbit suggestions for defense-in-depth SQL escaping:
database | replace("'", "''"),schema_name | replace("'", "''")database | replace("", "`")`While database/schema names come from trusted dbt configuration, these changes harden the macros against edge cases.
Review & Testing Checklist for Human
edr_drop_schemafromclear_env.sql: This macro was originally written for test environment cleanup. Verify it behaves correctly for the weekly cleanup workflow. Note that the default implementation usesdbt.drop_schema(), which on some adapters (e.g. Databricks) does not useDROP SCHEMA IF EXISTS. If a schema is concurrently dropped between listing and dropping, the macro will error. Consider whether a single failure + retry is acceptable.\``**: The Jinja template usesdatabase | replace("", "\")` to escape backticks. Verify Jinja doesn't double-escape the backslash (CI passed, but worth a sanity check).elementary_testsnamespace: Theschema_utils/helpers useadapter.dispatch('...', 'elementary_tests'). Verify the companion elementary PR (feat: add daily cleanup workflow for stale CI schemas elementary#2128) can call these macros or has its own copies.max_age_hours: 999999to only log "keep" messages) before the first scheduled run.Notes
_elementarysuffixed schemas (e.g.,dbt_260228_..._elementary) are correctly matched and cleaned up.utcnow()is used (deprecated in Python 3.12+) becausemodules.datetime.timezoneis not available in dbt's Jinja context. Bothutcnow()and the constructed datetime are naive, so comparisons are safe.DremioRelationrequires a different path structure that's incompatible with standardapi.Relation.create(database=..., schema=...)calls.clickhouse__schema_existstomacros/utils/cross_db_utils/schema_exists.sql.Race condition with parallel CI schemas
During CI development, the integration test initially used
max_age_hours=1and failed on Databricks withSCHEMA_NOT_FOUND. The test's cleanup macro listed ALLdbt_*schemas in the database, including leftover schemas from previous CI runs (e.g.,dbt_260228_120808_..._gw4created ~1.5 hours earlier). When the macro tried to drop these, they had already been cleaned up by their respective CI runs, causing the error.The fix: increased the test threshold to
max_age_hours=8760(1 year) so only the artificially old test schema (year 2020 = 26 years old) exceeds the threshold, and real CI schemas (at most a few hours old) are safely ignored. This same race condition could occur in production if two cleanup workflows run concurrently, but the weekly schedule makes this unlikely.Link to Devin run: https://app.devin.ai/sessions/7f7e889a382c4a8e90d83e9e2636eef3
Requested by: @haritamar
Summary by CodeRabbit
New Features
Chores
Tests