-
Notifications
You must be signed in to change notification settings - Fork 117
Update docs + add more tests #1233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reviewer's GuideThis PR introduces a new File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dreadatour - I've reviewed your changes - here's some feedback:
- Many of the newly added
test_count
,test_distinct
, andtest_filter
functions follow the same pattern and could be consolidated with pytest.mark.parametrize to reduce duplication and improve readability. - There aren’t any tests covering the new
reset_schema
,add_schema
,remove_file_signals
, orchunk
methods—adding tests for those would help verify their behavior. - Docstrings currently mix ‘Args’ and ‘Parameters’ styles—please pick one convention and apply it consistently across all methods.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Many of the newly added `test_count`, `test_distinct`, and `test_filter` functions follow the same pattern and could be consolidated with pytest.mark.parametrize to reduce duplication and improve readability.
- There aren’t any tests covering the new `reset_schema`, `add_schema`, `remove_file_signals`, or `chunk` methods—adding tests for those would help verify their behavior.
- Docstrings currently mix ‘Args’ and ‘Parameters’ styles—please pick one convention and apply it consistently across all methods.
## Individual Comments
### Comment 1
<location> `src/datachain/func/conditional.py:294` </location>
<code_context>
return Func("and", inner=sql_and, cols=cols, args=func_args, result_type=bool)
+
+
+def not_(arg: Union[ColumnElement, Func]) -> Func:
+ """
+ Returns the function that produces NOT of the given expressions.
</code_context>
<issue_to_address>
The handling of string arguments in not_ may be inconsistent with and_/or_.
In not_, both strings and Funcs are added to 'cols', unlike and_ where only strings go to 'cols' and Funcs to 'func_args'. Please standardize argument handling for consistency.
</issue_to_address>
### Comment 2
<location> `tests/unit/lib/test_datachain.py:4291` </location>
<code_context>
+def test_column_compute(test_session):
</code_context>
<issue_to_address>
Missing edge case: sum/avg/min/max on empty columns or all-non-numeric columns.
Please add tests to verify how these aggregation methods behave with empty or all non-numeric columns, ensuring they handle such cases appropriately.
</issue_to_address>
### Comment 3
<location> `tests/unit/lib/test_datachain.py:3019` </location>
<code_context>
+ assert chain.to_values("numbers") == [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+
+
+def test_filter_with_func_operations(test_session):
+ """Test filter with datachain.func operations."""
+ from datachain.func import string
+
+ chain = dc.read_values(
+ names=["Alice", "Bob", "Charlie", "David", "Eva"],
+ ages=[25, 30, 35, 40, 45],
+ session=test_session,
+ )
+
+ # Test string length filter
+ filtered_chain = chain.filter(string.length(C("names")) > 4)
+ assert filtered_chain.count() == 3
+ assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]
+
+
</code_context>
<issue_to_address>
Consider adding filter tests for null/missing values.
Adding such tests will help ensure the filter logic correctly handles null or missing values without errors and produces the expected results.
</issue_to_address>
<suggested_fix>
<<<<<<< SEARCH
def test_filter_with_func_operations(test_session):
"""Test filter with datachain.func operations."""
from datachain.func import string
chain = dc.read_values(
names=["Alice", "Bob", "Charlie", "David", "Eva"],
ages=[25, 30, 35, 40, 45],
session=test_session,
)
# Test string length filter
filtered_chain = chain.filter(string.length(C("names")) > 4)
assert filtered_chain.count() == 3
assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]
=======
def test_filter_with_func_operations(test_session):
"""Test filter with datachain.func operations."""
from datachain.func import string
chain = dc.read_values(
names=["Alice", "Bob", "Charlie", "David", "Eva"],
ages=[25, 30, 35, 40, 45],
session=test_session,
)
# Test string length filter
filtered_chain = chain.filter(string.length(C("names")) > 4)
assert filtered_chain.count() == 3
assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]
def test_filter_with_null_values(test_session):
"""Test filter operations with null/missing values."""
from datachain.func import string
# Include None (null) and missing values
chain = dc.read_values(
names=["Alice", None, "Charlie", "", "Eva", None],
ages=[25, 30, None, 40, 45, None],
session=test_session,
)
# Filter out rows where names is None
filtered_chain = chain.filter(C("names") != None)
assert filtered_chain.to_values("names") == ["Alice", "Charlie", "", "Eva"]
# Filter for rows where ages is None
null_ages_chain = chain.filter(C("ages") == None)
assert null_ages_chain.to_values("names") == ["Charlie", None]
# Filter for non-empty, non-null names with length > 0
non_empty_names_chain = chain.filter((C("names") != None) & (string.length(C("names")) > 0))
assert non_empty_names_chain.to_values("names") == ["Alice", "Charlie", "Eva"]
# Filter for rows where names is missing or empty
missing_or_empty_names_chain = chain.filter((C("names") == None) | (string.length(C("names")) == 0))
assert missing_or_empty_names_chain.to_values("names") == [None, "" , None]
>>>>>>> REPLACE
</suggested_fix>
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new logical not_
function, expands and fixes tests for schema resolution and DataChain operations, updates documentation for several DataChain methods, and patches a bug in the signal lookup logic.
- Add and test
dc.func.not_
alongside existing boolean functions. - Revise
_find_in_tree
inSignalSchema
to better handle unmatched paths. - Enhance docs for schema management and aggregation methods; add comprehensive tests for
count
,distinct
, andfilter
.
Reviewed Changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
tests/unit/test_func.py | Import and test the new not_ function with SQLite skips |
tests/unit/lib/test_signal_schema.py | Parameterized test_resolve_error to cover more invalid cases |
tests/unit/lib/test_datachain.py | Bulk addition of count , distinct , filter , and aggregation tests |
tests/func/functions/test_conditional.py | Rename and extend conditional logic test to include not_ |
src/datachain/lib/signal_schema.py | Refactor _find_in_tree to unify dotted-path lookup and error handling |
src/datachain/lib/dc/datachain.py | Document reset_schema , add_schema , remove_file_signals , sum , avg , min , max , chunk ; import StandardType |
src/datachain/func/conditional.py | Implement new not_ function wrapping SQLAlchemy’s not_ |
src/datachain/func/init.py | Expose not_ in the public API import list |
Comments suppressed due to low confidence (2)
src/datachain/func/conditional.py:294
- The docstring describes support for string column names, but the signature omits
str
. Change the annotation toUnion[str, ColumnElement, Func]
to match intent and keep consistency withand_
andor_
.
def not_(arg: Union[ColumnElement, Func]) -> Func:
src/datachain/lib/dc/datachain.py:369
- [nitpick] The docstring lists parameters but omits a Returns section. Add a
Returns:
entry (e.g.,Self
) for clarity and consistency with other methods.
def reset_schema(self, signals_schema: SignalSchema) -> "Self":
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1233 +/- ##
==========================================
+ Coverage 88.66% 88.70% +0.03%
==========================================
Files 153 153
Lines 13793 13792 -1
Branches 1927 1928 +1
==========================================
+ Hits 12230 12234 +4
+ Misses 1109 1103 -6
- Partials 454 455 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Deploying datachain-documentation with
|
Latest commit: |
b4576b4
|
Status: | ✅ Deploy successful! |
Preview URL: | https://ae362322.datachain-documentation.pages.dev |
Branch Preview URL: | https://docs-tests-update.datachain-documentation.pages.dev |
"""Compute the minimum of a column. | ||
|
||
Parameters: | ||
col: The column to compute the minimum for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we inline some example of how that column can look like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated (for sum
, avg
, min
and max
methods), please, take a look.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(some modifications might be still needed - e.g. hiding removing weird methods)
Tiny improvements:
dc.func.not_
function + testsreset_schema
,add_schema
,remove_file_signals
,sum
,avg
,min
,max
andchunk
(+ very minor updates in some other methods)SignalSchema._find_in_tree
+ add more tests for this methodcount
,distinct
andfilter
Summary by Sourcery
Add new conditional helper, improve schema resolution, document and implement aggregation methods, and expand test coverage across DataChain operations
New Features:
Bug Fixes:
Enhancements:
Tests: