-
Notifications
You must be signed in to change notification settings - Fork 116
partition_by with complex signals - includes agg and group_by #1215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
partition_by with complex signals - includes agg and group_by #1215
Conversation
Co-authored-by: dmitry <[email protected]>
Reviewer's GuideThis PR enables using complex DataModel-based signal types (e.g., File, ImageFile, nested Pydantic models) directly in the Class diagram for updated SignalSchema and DataChain complex signal partitioningclassDiagram
class DataChain {
+agg(..., partition_by)
+group_by(..., partition_by)
-_process_complex_signal_column(column_name: str) list[Column]
}
class SignalSchema {
+to_partial(*columns: str) SignalSchema
+group_by(partition_by: Sequence[str], new_column: Sequence[Column]) SignalSchema
+db_signals(name: Optional[str])
}
class ModelStore {
+to_pydantic(val) -> Optional[type[BaseModel]]
+is_partial(parent_type) -> bool
}
DataChain --> SignalSchema : uses
SignalSchema --> ModelStore : uses
DataChain --> ModelStore : uses
class Column {
+name: str
+type: Any
}
class File {
<<Pydantic BaseModel>>
+name: str
+size: int
+path: str
}
SignalSchema --> File : handles complex signals
DataChain --> Column : returns partition columns
Class diagram for recursive expansion of complex signals in partition_byclassDiagram
class DataChain {
-_process_complex_signal_column(column_name: str) list[Column]
}
class Column {
+name: str
+type: Any
}
class File {
<<Pydantic BaseModel>>
+name: str
+size: int
+path: str
}
DataChain --> File : expands fields
DataChain --> Column : returns list[Column]
File o-- Column : fields become columns
File-Level Changes
Assessment against linked issues
Possibly linked issues
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dmpetrov - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `src/datachain/lib/dc/datachain.py:844` </location>
<code_context>
elif isinstance(col, Function):
column = col.get_column(self.signals_schema)
processed_partition_columns.append(column)
+ elif isinstance(col, type) and issubclass(col, DataModel):
+ # Handle complex signal types (like File, Image, etc.)
+ complex_columns = self._process_complex_signal_partition(col)
+ processed_partition_columns.extend(complex_columns)
else:
# Assume it's already a ColumnElement
</code_context>
<issue_to_address>
The handling of complex signal types assumes a single schema match.
Currently, only the first matching signal is used when multiple signals share the same DataModel type. Please clarify if this is intentional, and consider adding a warning or error if not.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
elif isinstance(col, type) and issubclass(col, DataModel):
# Handle complex signal types (like File, Image, etc.)
complex_columns = self._process_complex_signal_partition(col)
processed_partition_columns.extend(complex_columns)
=======
elif isinstance(col, type) and issubclass(col, DataModel):
# Handle complex signal types (like File, Image, etc.)
# Check for multiple matching signals in the schema
matching_signals = [
signal for signal in self.signals_schema
if isinstance(signal, col)
]
if len(matching_signals) > 1:
import warnings
warnings.warn(
f"Multiple signals found for DataModel type {col.__name__} in signals_schema. "
"Only the first match will be used. Please clarify if this is intentional."
)
if matching_signals:
complex_columns = self._process_complex_signal_partition(type(matching_signals[0]))
processed_partition_columns.extend(complex_columns)
else:
raise ValueError(f"No signals found for DataModel type {col.__name__} in signals_schema.")
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `src/datachain/lib/dc/datachain.py:966` </location>
<code_context>
+
+ # Generate column objects for each unique key
+ partition_columns = []
+ for key in unique_keys:
+ col_name = f"{signal_name}.{key}"
+ col_db_name = ColumnMeta.to_db_name(col_name)
+ try:
+ col_type = self.signals_schema.get_column_type(col_db_name)
+ column = Column(col_db_name, python_to_sql(col_type))
+ partition_columns.append(column)
+ except Exception:
+ # Skip columns that don't exist in the schema
+ continue
+
+ if not partition_columns:
</code_context>
<issue_to_address>
Silently skipping missing columns may hide schema issues.
Consider logging a warning or raising an error when expected unique keys are missing from the schema to aid debugging.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
for key in unique_keys:
col_name = f"{signal_name}.{key}"
col_db_name = ColumnMeta.to_db_name(col_name)
try:
col_type = self.signals_schema.get_column_type(col_db_name)
column = Column(col_db_name, python_to_sql(col_type))
partition_columns.append(column)
except Exception:
# Skip columns that don't exist in the schema
continue
=======
import logging
for key in unique_keys:
col_name = f"{signal_name}.{key}"
col_db_name = ColumnMeta.to_db_name(col_name)
try:
col_type = self.signals_schema.get_column_type(col_db_name)
column = Column(col_db_name, python_to_sql(col_type))
partition_columns.append(column)
except Exception:
logging.warning(
f"Unique key '{key}' (db column '{col_db_name}') not found in schema for signal '{signal_name}'."
)
continue
>>>>>>> REPLACE
</suggested_fix>
### Comment 3
<location> `src/datachain/lib/dc/datachain.py:977` </location>
<code_context>
+ # Skip columns that don't exist in the schema
+ continue
+
+ if not partition_columns:
+ raise ValueError(
+ f"No valid partition columns found for signal type {signal_type}"
</code_context>
<issue_to_address>
Raising ValueError for no valid partition columns is appropriate but could be more informative.
Consider adding the signal name and attempted unique keys to the error message for easier debugging.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
if not partition_columns:
raise ValueError(
f"No valid partition columns found for signal type {signal_type}"
)
=======
if not partition_columns:
raise ValueError(
f"No valid partition columns found for signal type '{signal_type}', signal name '{signal_name}'. "
f"Attempted unique keys: {unique_keys}"
)
>>>>>>> REPLACE
</suggested_fix>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
src/datachain/lib/dc/datachain.py
Outdated
elif isinstance(col, type) and issubclass(col, DataModel): | ||
# Handle complex signal types (like File, Image, etc.) | ||
complex_columns = self._process_complex_signal_partition(col) | ||
processed_partition_columns.extend(complex_columns) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): The handling of complex signal types assumes a single schema match.
Currently, only the first matching signal is used when multiple signals share the same DataModel type. Please clarify if this is intentional, and consider adding a warning or error if not.
elif isinstance(col, type) and issubclass(col, DataModel): | |
# Handle complex signal types (like File, Image, etc.) | |
complex_columns = self._process_complex_signal_partition(col) | |
processed_partition_columns.extend(complex_columns) | |
elif isinstance(col, type) and issubclass(col, DataModel): | |
# Handle complex signal types (like File, Image, etc.) | |
# Check for multiple matching signals in the schema | |
matching_signals = [ | |
signal for signal in self.signals_schema | |
if isinstance(signal, col) | |
] | |
if len(matching_signals) > 1: | |
import warnings | |
warnings.warn( | |
f"Multiple signals found for DataModel type {col.__name__} in signals_schema. " | |
"Only the first match will be used. Please clarify if this is intentional." | |
) | |
if matching_signals: | |
complex_columns = self._process_complex_signal_partition(type(matching_signals[0])) | |
processed_partition_columns.extend(complex_columns) | |
else: | |
raise ValueError(f"No signals found for DataModel type {col.__name__} in signals_schema.") |
src/datachain/lib/dc/datachain.py
Outdated
for key in unique_keys: | ||
col_name = f"{signal_name}.{key}" | ||
col_db_name = ColumnMeta.to_db_name(col_name) | ||
try: | ||
col_type = self.signals_schema.get_column_type(col_db_name) | ||
column = Column(col_db_name, python_to_sql(col_type)) | ||
partition_columns.append(column) | ||
except Exception: | ||
# Skip columns that don't exist in the schema | ||
continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (bug_risk): Silently skipping missing columns may hide schema issues.
Consider logging a warning or raising an error when expected unique keys are missing from the schema to aid debugging.
for key in unique_keys: | |
col_name = f"{signal_name}.{key}" | |
col_db_name = ColumnMeta.to_db_name(col_name) | |
try: | |
col_type = self.signals_schema.get_column_type(col_db_name) | |
column = Column(col_db_name, python_to_sql(col_type)) | |
partition_columns.append(column) | |
except Exception: | |
# Skip columns that don't exist in the schema | |
continue | |
import logging | |
for key in unique_keys: | |
col_name = f"{signal_name}.{key}" | |
col_db_name = ColumnMeta.to_db_name(col_name) | |
try: | |
col_type = self.signals_schema.get_column_type(col_db_name) | |
column = Column(col_db_name, python_to_sql(col_type)) | |
partition_columns.append(column) | |
except Exception: | |
logging.warning( | |
f"Unique key '{key}' (db column '{col_db_name}') not found in schema for signal '{signal_name}'." | |
) | |
continue |
src/datachain/lib/dc/datachain.py
Outdated
if not partition_columns: | ||
raise ValueError( | ||
f"No valid partition columns found for signal type {signal_type}" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: Raising ValueError for no valid partition columns is appropriate but could be more informative.
Consider adding the signal name and attempted unique keys to the error message for easier debugging.
if not partition_columns: | |
raise ValueError( | |
f"No valid partition columns found for signal type {signal_type}" | |
) | |
if not partition_columns: | |
raise ValueError( | |
f"No valid partition columns found for signal type '{signal_type}', signal name '{signal_name}'. " | |
f"Attempted unique keys: {unique_keys}" | |
) |
src/datachain/lib/dc/datachain.py
Outdated
) | ||
|
||
# Find the signal name in the schema that matches this type | ||
signal_name = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): Use the built-in function next
instead of a for-loop (use-next
)
Deploying datachain-documentation with
|
Latest commit: |
09b8a6e
|
Status: | ✅ Deploy successful! |
Preview URL: | https://06b72849.datachain-documentation.pages.dev |
Branch Preview URL: | https://cursor-propose-pr-for-comple.datachain-documentation.pages.dev |
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1215 +/- ##
=======================================
Coverage 88.68% 88.69%
=======================================
Files 152 152
Lines 13606 13680 +74
Branches 1893 1909 +16
=======================================
+ Hits 12067 12133 +66
- Misses 1093 1099 +6
- Partials 446 448 +2
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dmpetrov - I've reviewed your changes - here's some feedback:
- Consider refactoring the complex signal expansion logic in
_process_complex_signal_column
andgroup_by
into smaller shared helpers to reduce duplication and improve maintainability. - The
to_partial
method handles many nested edge cases in one large function—splitting it into dedicated helpers (e.g., full signal vs partial field handling) would make it more readable and testable. - Deeply nested model expansions can generate very wide queries; adding a safeguard or configuration to limit column explosion could help prevent potential performance issues.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Consider refactoring the complex signal expansion logic in `_process_complex_signal_column` and `group_by` into smaller shared helpers to reduce duplication and improve maintainability.
- The `to_partial` method handles many nested edge cases in one large function—splitting it into dedicated helpers (e.g., full signal vs partial field handling) would make it more readable and testable.
- Deeply nested model expansions can generate very wide queries; adding a safeguard or configuration to limit column explosion could help prevent potential performance issues.
## Individual Comments
### Comment 1
<location> `src/datachain/lib/signal_schema.py:592` </location>
<code_context>
]
if name:
+ if "." in name:
+ name = name.replace(".", "__")
+
signals = [
</code_context>
<issue_to_address>
Replacing dots with double underscores in column names may cause collisions if original column names contain double underscores.
Recommend choosing a delimiter or escaping method that ensures unique column names, even if original names contain double underscores.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
if name:
if "." in name:
name = name.replace(".", "__")
=======
if name:
# Escape underscores and dots to avoid collisions
name = name.replace("_", "__UND__").replace(".", "__DOT__")
>>>>>>> REPLACE
</suggested_fix>
### Comment 2
<location> `src/datachain/lib/signal_schema.py:945` </location>
<code_context>
partial_versions: dict[str, int] = {}
def _type_name_to_partial(signal_name: str, type_name: str) -> str:
- if "@" not in type_name:
+ # Check if we need to create a partial for this type
+ # Only create partials for custom types that are in the custom_types dict
+ if type_name not in custom_types:
return type_name
- model_name, _ = ModelStore.parse_name_version(type_name)
</code_context>
<issue_to_address>
The check for custom_types before handling '@' in type_name may lead to inconsistent partial type handling.
Please review the logic to ensure partial types are created consistently, even for types that might be added to custom_types in the future.
</issue_to_address>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
def _type_name_to_partial(signal_name: str, type_name: str) -> str: | ||
if "@" not in type_name: | ||
# Check if we need to create a partial for this type | ||
# Only create partials for custom types that are in the custom_types dict | ||
if type_name not in custom_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): The check for custom_types before handling '@' in type_name may lead to inconsistent partial type handling.
Please review the logic to ensure partial types are created consistently, even for types that might be added to custom_types in the future.
@@ -969,7 +1022,7 @@ def select_except(self, *args: str) -> "Self": | |||
) | |||
|
|||
@delta_disabled # type: ignore[arg-type] | |||
def group_by( | |||
def group_by( # noqa: C901, PLR0912 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Hoist repeated code outside conditional statement [×3] (
hoist-statement-from-if
) - Low code quality found in DataChain.group_by - 14% (
low-code-quality
)
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me 👍
Allow complex signals (e.g.,
File
type) inagg
'spartition_by
parameter for improved usability (resolves #1211).ToDo:
PS: it turn out, the problem was much more complicated since group_by changes schema while the changes become tricky when a nested column is used in
partition_by
- see to_partial() and relatedSlack Thread
Summary by Sourcery
Add support for using complex signal types (DataModel subclasses) directly in the
partition_by
parameter of theagg
method.New Features:
partition_by
inagg
._process_complex_signal_partition
to resolve complex signal types into their unique identifier columns.Enhancements:
PartitionByType
to includetype
for complex signals in the dataset query definitions.agg
logic and docstrings to integrate complex signal handling alongside existing column-based partitioning.Documentation:
agg
method documentation with examples demonstrating file-based partitioning.Tests:
Summary by Sourcery
Enable using complex DataModel signals directly in partition_by for both agg and group_by by expanding nested model fields into SQL columns and extending schema partials, enrich documentation with examples, and add comprehensive tests for various partitioning scenarios
New Features:
Enhancements:
Documentation:
Tests: