-
Notifications
You must be signed in to change notification settings - Fork 117
Fix partition_by for queries without sys__id #1234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuidePartitioning now supports queries lacking a sys__id column by automatically creating and using a temporary table, and a new test ensures aggregation after group_by works in this scenario. Sequence diagram for partition_by with missing sys__idsequenceDiagram
participant Query as Query
participant Dataset as Dataset
participant Warehouse as Warehouse
participant TempTable as TempTable
Query->>Dataset: apply(partition_by)
Dataset->>Query: Check for sys__id in selected_columns
alt sys__id missing
Dataset->>Warehouse: create_dataset_rows_table(temp_table_name, columns)
Warehouse-->>TempTable: Return temp table
Dataset->>Warehouse: copy_table(temp_table, query)
Dataset->>TempTable: select()
Dataset->>Dataset: Use temp table as new query
end
Dataset->>Dataset: create_partitions_table(query)
Dataset->>Query: outerjoin with partition table
Dataset-->>Query: Return partitioned query
Class diagram for Dataset partitioning logic updateclassDiagram
class Dataset {
+partition_by
+apply(query)
+create_partitions_table(query)
}
class Warehouse {
+create_dataset_rows_table(name, columns)
+copy_table(temp_table, query)
+temp_table_name()
}
Dataset --> Warehouse : uses
Dataset : +apply(query)
Dataset : +create_partitions_table(query)
Warehouse : +create_dataset_rows_table(name, columns)
Warehouse : +copy_table(temp_table, query)
Warehouse : +temp_table_name()
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dreadatour - I've reviewed your changes - here's some feedback:
- Extract the duplicated sys__id existence checks in apply() and create_partitions_table() into a shared helper to reduce repetition.
- Consider using a SQL window function (e.g. ROW_NUMBER) to synthesize a row identifier instead of physically copying to a temp table for better performance.
- Move the temp‐table creation and copy logic into a private method (e.g. _ensure_sys_id) to keep apply() concise and improve readability.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Extract the duplicated sys__id existence checks in apply() and create_partitions_table() into a shared helper to reduce repetition.
- Consider using a SQL window function (e.g. ROW_NUMBER) to synthesize a row identifier instead of physically copying to a temp table for better performance.
- Move the temp‐table creation and copy logic into a private method (e.g. _ensure_sys_id) to keep apply() concise and improve readability.
## Individual Comments
### Comment 1
<location> `src/datachain/query/dataset.py:566` </location>
<code_context>
assert self.partition_by is not None
+ # Check if sys__id is in the query, we need it to be able to join
+ # the partition table with the udf table later.
+ assert any(c.name == "sys__id" for c in query.selected_columns), (
+ "Query must have sys__id column to use partitioning."
+ )
</code_context>
<issue_to_address>
Using assert for runtime validation may not be reliable in production.
Assertions can be disabled with the -O flag. Use a specific exception like ValueError to ensure this check is always enforced.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
# Check if sys__id is in the query, we need it to be able to join
# the partition table with the udf table later.
assert any(c.name == "sys__id" for c in query.selected_columns), (
"Query must have sys__id column to use partitioning."
)
=======
# Check if sys__id is in the query, we need it to be able to join
# the partition table with the udf table later.
if not any(c.name == "sys__id" for c in query.selected_columns):
raise ValueError("Query must have sys__id column to use partitioning.")
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `src/datachain/query/dataset.py:615` </location>
<code_context>
# Apply partitioning if needed.
if self.partition_by is not None:
+ if not any(c.name == "sys__id" for c in query.selected_columns):
+ # If sys__id is not in the query, we need to create a temp table
+ # to hold the query results, so we can join it with the
</code_context>
<issue_to_address>
Duplicated sys__id presence check could be refactored for clarity.
This logic is also present in create_partitions_table; centralizing it would help prevent inconsistencies and simplify maintenance.
Suggested implementation:
```python
# the partition table with the udf table later.
assert self._has_sys_id_column(query), (
"Query must have sys__id column to use partitioning."
)
```
```python
# Apply partitioning if needed.
if self.partition_by is not None:
if not self._has_sys_id_column(query):
# If sys__id is not in the query, we need to create a temp table
# to hold the query results, so we can join it with the
# partition table later.
columns = [
c if isinstance(c, Column) else Column(c.name, c.type)
for c in query.subquery().columns
]
temp_table = self.catalog.warehouse.create_dataset_rows_table(
self.catalog.warehouse.temp_table_name(),
columns=columns,
)
temp_tables.append(temp_table.name)
```
```python
def _has_sys_id_column(self, query):
"""
Returns True if the query's selected columns include a column named 'sys__id'.
"""
return any(c.name == "sys__id" for c in query.selected_columns)
```
You should also update any other locations in this file (such as `create_partitions_table` or similar methods) where the sys__id presence check is performed, replacing the direct check with a call to `self._has_sys_id_column(query)`.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
# Check if sys__id is in the query, we need it to be able to join | ||
# the partition table with the udf table later. | ||
assert any(c.name == "sys__id" for c in query.selected_columns), ( | ||
"Query must have sys__id column to use partitioning." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Using assert for runtime validation may not be reliable in production.
Assertions can be disabled with the -O flag. Use a specific exception like ValueError to ensure this check is always enforced.
# Check if sys__id is in the query, we need it to be able to join | |
# the partition table with the udf table later. | |
assert any(c.name == "sys__id" for c in query.selected_columns), ( | |
"Query must have sys__id column to use partitioning." | |
) | |
# Check if sys__id is in the query, we need it to be able to join | |
# the partition table with the udf table later. | |
if not any(c.name == "sys__id" for c in query.selected_columns): | |
raise ValueError("Query must have sys__id column to use partitioning.") |
@@ -606,6 +612,22 @@ def apply( | |||
|
|||
# Apply partitioning if needed. | |||
if self.partition_by is not None: | |||
if not any(c.name == "sys__id" for c in query.selected_columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Duplicated sys__id presence check could be refactored for clarity.
This logic is also present in create_partitions_table; centralizing it would help prevent inconsistencies and simplify maintenance.
Suggested implementation:
# the partition table with the udf table later.
assert self._has_sys_id_column(query), (
"Query must have sys__id column to use partitioning."
)
# Apply partitioning if needed.
if self.partition_by is not None:
if not self._has_sys_id_column(query):
# If sys__id is not in the query, we need to create a temp table
# to hold the query results, so we can join it with the
# partition table later.
columns = [
c if isinstance(c, Column) else Column(c.name, c.type)
for c in query.subquery().columns
]
temp_table = self.catalog.warehouse.create_dataset_rows_table(
self.catalog.warehouse.temp_table_name(),
columns=columns,
)
temp_tables.append(temp_table.name)
def _has_sys_id_column(self, query):
"""
Returns True if the query's selected columns include a column named 'sys__id'.
"""
return any(c.name == "sys__id" for c in query.selected_columns)
You should also update any other locations in this file (such as create_partitions_table
or similar methods) where the sys__id presence check is performed, replacing the direct check with a call to self._has_sys_id_column(query)
.
@@ -606,6 +612,22 @@ def apply( | |||
|
|||
# Apply partitioning if needed. | |||
if self.partition_by is not None: | |||
if not any(c.name == "sys__id" for c in query.selected_columns): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Invert any/all to simplify comparisons (invert-any-all
)
if not any(c.name == "sys__id" for c in query.selected_columns): | |
if all(c.name != "sys__id" for c in query.selected_columns): |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1234 +/- ##
=======================================
Coverage 88.66% 88.67%
=======================================
Files 153 153
Lines 13793 13800 +7
Branches 1927 1928 +1
=======================================
+ Hits 12230 12237 +7
Misses 1109 1109
Partials 454 454
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems a bit of an overkill tbh.
- we do an optimization to use ids
- but now we have to copy tables to have those ids
Q: can we avoid copying? (put assert to fail fast though if it's not present) - can all operations include sys_id at the end?
Q: are we relying on sys_id in some other operations - will we have to copy tables there as well eventually?
Agree, but I can not find a proper way to do this without materializing query to the temp table for the Also the only case I know when query have no Another possible option for query to do not have
No, this is different scenarios.
The reason we cannot avoid copying is that some SQL queries do not return IDs in their results, and I have not found a reliable way to add those IDs to the queries. I could add something like I could add something like More details on internals below (in the next comment).
No, aggregation with |
So why do we need DataChain in general follows classic ETL pattern:
Transformation can be done on two levels:
All transformations can be chained, and results of the previous transformation became an input to the next one. One special case is aggregation with partitioning. Since query from all previous transformers (steps) can be complex, we are doing partitioning the following way:
The reason why we do need to have
Thus Moreover, I am now thinking we should always copy query to the temp table, because we are running query twice and we need to be 100% results will be idempotent. We can get rid of temp table with two fields: |
I also did some unsuccessful experiments.
I keep investigating, but for now I am thinking creating temp table in case of aggregation with partitioning is a best way to handle this issue and to avoid any possible issues in future with queries running twice (e.g. someone want to use |
Part of the #1213
In case query have no
sys_id
field (for example,group_by
query), aggregate does not works (see traceback below).In this PR I am fixing this specific case by creating temp table and copying query result to it.
Test was also added.
Summary by Sourcery
Enable partitioning for queries that lack a sys__id column by copying the query results into a temporary table before applying partition logic, and add a test to verify aggregation after group_by.
Bug Fixes:
Tests: