Update docs + add more tests #1233

dreadatour · 2025-07-13T06:25:03Z

Tiny improvements:

Add dc.func.not_ function + tests
Add/update docs for DataChain methods: reset_schema, add_schema, remove_file_signals, sum, avg, min, max and chunk (+ very minor updates in some other methods)
Fix bug in SignalSchema._find_in_tree + add more tests for this method
Add more tests for DataChain methods: count, distinct and filter

Summary by Sourcery

Add new conditional helper, improve schema resolution, document and implement aggregation methods, and expand test coverage across DataChain operations

New Features:

Add dc.func.not_ conditional function

Bug Fixes:

Fix SignalSchema._find_in_tree lookup logic and extend error handling

Enhancements:

Enhance and unify docstrings for DataChain methods reset_schema, add_schema, remove_file_signals, sum, avg, min, max, and chunk
Expose not_ in func module all

Tests:

Add extensive unit tests for DataChain count, distinct, filter, and aggregation methods
Add tests for the not_ function in mutates and conditional functions
Parameterize resolve error scenarios in SignalSchema tests
Add test_column_compute covering sum, avg, min, and max operations on nested data

…in functions

sourcery-ai · 2025-07-13T06:25:08Z

Reviewer's Guide

This PR introduces a new not_ conditional function with accompanying exports and tests, enriches and standardizes docstrings across various DataChain methods, corrects a traversal bug in SignalSchema._find_in_tree with improved error handling and adds parameterized error tests, and significantly expands unit test coverage for counting, distinct, filtering, and aggregation operations.

File-Level Changes

Change	Details	Files
Add `not_` function support in datachain.func	Implement `not_` in conditional.py to wrap SQL NOT logic Expose `not_` in func/init.py Add unit tests for `not_` in mutate operations Extend conditional function tests to include `not_` cases	`src/datachain/func/conditional.py` `src/datachain/func/__init__.py` `tests/unit/lib/test_func.py` `tests/func/functions/test_conditional.py`
Enhance documentation for DataChain methods	Expand docstrings and parameter sections for schema operations (`reset_schema`, `add_schema`, `remove_file_signals`) Standardize parameter and return descriptions for data conversion methods (`to_pandas`, `show`) Add detailed doc comments for aggregation and utility methods (`sum`, `avg`, `min`, `max`, `sample`, `chunk`, `to_list`, `to_values`)	`src/datachain/lib/dc/datachain.py`
Fix bug in SignalSchema tree traversal	Simplify direct path lookup in `_find_in_tree` and refine traversal loop Tighten error condition to ensure full path consumption Remove outdated special-case branch Introduce parameterized tests for resolve errors across invalid input scenarios	`src/datachain/lib/signal_schema.py` `tests/unit/lib/test_signal_schema.py`
Broaden test coverage for core DataChain operations	Add comprehensive tests for `count` across basic, complex, chained, and in-memory scenarios Introduce extensive `distinct` tests over simple, nested, multi-column, and error cases Expand `filter` tests covering comparisons, patterns, logical operators, chaining, and complex objects Include validation of aggregation results (`sum`, `avg`, `min`, `max`) on nested data models	`tests/unit/lib/test_datachain.py`

Tips and commands

Interacting with Sourcery

Trigger a new review: Comment @sourcery-ai review on the pull request.
Continue discussions: Reply directly to Sourcery's review comments.
Generate a GitHub issue from a review comment: Ask Sourcery to create an
issue from a review comment by replying to it. You can also reply to a
review comment with @sourcery-ai issue to create an issue from it.
Generate a pull request title: Write @sourcery-ai anywhere in the pull
request title to generate a title at any time. You can also comment
@sourcery-ai title on the pull request to (re-)generate the title at any time.
Generate a pull request summary: Write @sourcery-ai summary anywhere in
the pull request body to generate a PR summary at any time exactly where you
want it. You can also comment @sourcery-ai summary on the pull request to
(re-)generate the summary at any time.
Generate reviewer's guide: Comment @sourcery-ai guide on the pull
request to (re-)generate the reviewer's guide at any time.
Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
pull request to resolve all Sourcery comments. Useful if you've already
addressed all the comments and don't want to see them anymore.
Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
request to dismiss all existing Sourcery reviews. Especially useful if you
want to start fresh with a new review - don't forget to comment
@sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

Enable or disable review features such as the Sourcery-generated pull request
summary, the reviewer's guide, and others.
Change the review language.
Add, remove or edit custom review instructions.
Adjust other review settings.

Getting Help

Contact our support team for questions or feedback.
Visit our documentation for detailed guides and information.
Keep in touch with the Sourcery team by following us on X/Twitter, LinkedIn or GitHub.

sourcery-ai

Hey @dreadatour - I've reviewed your changes - here's some feedback:

Many of the newly added test_count, test_distinct, and test_filter functions follow the same pattern and could be consolidated with pytest.mark.parametrize to reduce duplication and improve readability.
There aren’t any tests covering the new reset_schema, add_schema, remove_file_signals, or chunk methods—adding tests for those would help verify their behavior.
Docstrings currently mix ‘Args’ and ‘Parameters’ styles—please pick one convention and apply it consistently across all methods.

Prompt for AI Agents

Please address the comments from this code review:
## Overall Comments
- Many of the newly added `test_count`, `test_distinct`, and `test_filter` functions follow the same pattern and could be consolidated with pytest.mark.parametrize to reduce duplication and improve readability.
- There aren’t any tests covering the new `reset_schema`, `add_schema`, `remove_file_signals`, or `chunk` methods—adding tests for those would help verify their behavior.
- Docstrings currently mix ‘Args’ and ‘Parameters’ styles—please pick one convention and apply it consistently across all methods.

## Individual Comments

### Comment 1
<location> `src/datachain/func/conditional.py:294` </location>
<code_context>
     return Func("and", inner=sql_and, cols=cols, args=func_args, result_type=bool)
+
+
+def not_(arg: Union[ColumnElement, Func]) -> Func:
+    """
+    Returns the function that produces NOT of the given expressions.
</code_context>

<issue_to_address>
The handling of string arguments in not_ may be inconsistent with and_/or_.

In not_, both strings and Funcs are added to 'cols', unlike and_ where only strings go to 'cols' and Funcs to 'func_args'. Please standardize argument handling for consistency.
</issue_to_address>

### Comment 2
<location> `tests/unit/lib/test_datachain.py:4291` </location>
<code_context>
+def test_column_compute(test_session):
</code_context>

<issue_to_address>
Missing edge case: sum/avg/min/max on empty columns or all-non-numeric columns.

Please add tests to verify how these aggregation methods behave with empty or all non-numeric columns, ensuring they handle such cases appropriately.
</issue_to_address>

### Comment 3
<location> `tests/unit/lib/test_datachain.py:3019` </location>
<code_context>
+    assert chain.to_values("numbers") == [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+
+
+def test_filter_with_func_operations(test_session):
+    """Test filter with datachain.func operations."""
+    from datachain.func import string
+
+    chain = dc.read_values(
+        names=["Alice", "Bob", "Charlie", "David", "Eva"],
+        ages=[25, 30, 35, 40, 45],
+        session=test_session,
+    )
+
+    # Test string length filter
+    filtered_chain = chain.filter(string.length(C("names")) > 4)
+    assert filtered_chain.count() == 3
+    assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]
+
+
</code_context>

<issue_to_address>
Consider adding filter tests for null/missing values.

Adding such tests will help ensure the filter logic correctly handles null or missing values without errors and produces the expected results.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_filter_with_func_operations(test_session):
    """Test filter with datachain.func operations."""
    from datachain.func import string

    chain = dc.read_values(
        names=["Alice", "Bob", "Charlie", "David", "Eva"],
        ages=[25, 30, 35, 40, 45],
        session=test_session,
    )

    # Test string length filter
    filtered_chain = chain.filter(string.length(C("names")) > 4)
    assert filtered_chain.count() == 3
    assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]


=======
def test_filter_with_func_operations(test_session):
    """Test filter with datachain.func operations."""
    from datachain.func import string

    chain = dc.read_values(
        names=["Alice", "Bob", "Charlie", "David", "Eva"],
        ages=[25, 30, 35, 40, 45],
        session=test_session,
    )

    # Test string length filter
    filtered_chain = chain.filter(string.length(C("names")) > 4)
    assert filtered_chain.count() == 3
    assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]

def test_filter_with_null_values(test_session):
    """Test filter operations with null/missing values."""
    from datachain.func import string

    # Include None (null) and missing values
    chain = dc.read_values(
        names=["Alice", None, "Charlie", "", "Eva", None],
        ages=[25, 30, None, 40, 45, None],
        session=test_session,
    )

    # Filter out rows where names is None
    filtered_chain = chain.filter(C("names") != None)
    assert filtered_chain.to_values("names") == ["Alice", "Charlie", "", "Eva"]

    # Filter for rows where ages is None
    null_ages_chain = chain.filter(C("ages") == None)
    assert null_ages_chain.to_values("names") == ["Charlie", None]

    # Filter for non-empty, non-null names with length > 0
    non_empty_names_chain = chain.filter((C("names") != None) & (string.length(C("names")) > 0))
    assert non_empty_names_chain.to_values("names") == ["Alice", "Charlie", "Eva"]

    # Filter for rows where names is missing or empty
    missing_or_empty_names_chain = chain.filter((C("names") == None) | (string.length(C("names")) == 0))
    assert missing_or_empty_names_chain.to_values("names") == [None, "" , None]
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

src/datachain/func/conditional.py

tests/unit/lib/test_datachain.py

Copilot

Pull Request Overview

This PR introduces a new logical not_ function, expands and fixes tests for schema resolution and DataChain operations, updates documentation for several DataChain methods, and patches a bug in the signal lookup logic.

Add and test dc.func.not_ alongside existing boolean functions.
Revise _find_in_tree in SignalSchema to better handle unmatched paths.
Enhance docs for schema management and aggregation methods; add comprehensive tests for count, distinct, and filter.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
tests/unit/test_func.py	Import and test the new `not_` function with SQLite skips
tests/unit/lib/test_signal_schema.py	Parameterized `test_resolve_error` to cover more invalid cases
tests/unit/lib/test_datachain.py	Bulk addition of `count`, `distinct`, `filter`, and aggregation tests
tests/func/functions/test_conditional.py	Rename and extend conditional logic test to include `not_`
src/datachain/lib/signal_schema.py	Refactor `_find_in_tree` to unify dotted-path lookup and error handling
src/datachain/lib/dc/datachain.py	Document `reset_schema`, `add_schema`, `remove_file_signals`, `sum`, `avg`, `min`, `max`, `chunk`; import `StandardType`
src/datachain/func/conditional.py	Implement new `not_` function wrapping SQLAlchemy’s `not_`
src/datachain/func/init.py	Expose `not_` in the public API import list

Comments suppressed due to low confidence (2)

src/datachain/func/conditional.py:294

The docstring describes support for string column names, but the signature omits str. Change the annotation to Union[str, ColumnElement, Func] to match intent and keep consistency with and_ and or_.

def not_(arg: Union[ColumnElement, Func]) -> Func:

src/datachain/lib/dc/datachain.py:369

[nitpick] The docstring lists parameters but omits a Returns section. Add a Returns: entry (e.g., Self) for clarity and consistency with other methods.

    def reset_schema(self, signals_schema: SignalSchema) -> "Self":

tests/unit/lib/test_signal_schema.py

codecov · 2025-07-13T06:31:11Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.70%. Comparing base (8b3c25a) to head (b4576b4).
Report is 2 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1233      +/-   ##
==========================================
+ Coverage   88.66%   88.70%   +0.03%     
==========================================
  Files         153      153              
  Lines       13793    13792       -1     
  Branches     1927     1928       +1     
==========================================
+ Hits        12230    12234       +4     
+ Misses       1109     1103       -6     
- Partials      454      455       +1

Flag	Coverage Δ
datachain	`88.63% <100.00%> (+0.03%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/datachain/func/__init__.py	`100.00% <100.00%> (ø)`
src/datachain/func/conditional.py	`100.00% <100.00%> (ø)`
src/datachain/lib/dc/datachain.py	`91.40% <100.00%> (+1.00%)`	⬆️
src/datachain/lib/pytorch.py	`88.80% <100.00%> (+0.09%)`	⬆️
src/datachain/lib/signal_schema.py	`96.10% <100.00%> (ø)`

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cloudflare-workers-and-pages · 2025-07-13T11:50:50Z

Deploying datachain-documentation with Cloudflare Pages

Latest commit:	`b4576b4`
Status:	✅ Deploy successful!
Preview URL:	https://ae362322.datachain-documentation.pages.dev
Branch Preview URL:	https://docs-tests-update.datachain-documentation.pages.dev

View logs

src/datachain/lib/dc/datachain.py

shcheklein · 2025-07-13T16:57:33Z

src/datachain/lib/dc/datachain.py

+        """Compute the minimum of a column.
+
+        Parameters:
+            col: The column to compute the minimum for.


can we inline some example of how that column can look like?

Updated (for sum, avg, min and max methods), please, take a look.

src/datachain/lib/dc/datachain.py

shcheklein

(some modifications might be still needed - e.g. hiding removing weird methods)

…file_signals

dreadatour added 6 commits July 13, 2025 13:22

Update DataChain docstrings + add tests for sum, avg, min and max cha…

fa5d711

…in functions

Add more tests for DataChain.count method

9b33578

Fix SignalSchema._find_in_tree + add more tests

14da78e

Add more tests for DataChain.distinct method

7f915a7

Add 'not_' conditional function

158ea28

Add more tests for DataChain.filter method

de24f0c

dreadatour requested review from shcheklein, dmpetrov, a team and Copilot July 13, 2025 06:25

dreadatour self-assigned this Jul 13, 2025

sourcery-ai bot reviewed Jul 13, 2025

View reviewed changes

Copilot AI reviewed Jul 13, 2025

View reviewed changes

tests/unit/lib/test_signal_schema.py Show resolved Hide resolved

Update DataChain methods typings to fix 'mkdocs build'

be87e4f

dreadatour added 2 commits July 13, 2025 23:14

Fix tests for SaaS

0c85fc5

Fix tests for SaaS

57df8da

shcheklein reviewed Jul 13, 2025

View reviewed changes

src/datachain/lib/dc/datachain.py Outdated Show resolved Hide resolved

shcheklein reviewed Jul 13, 2025

View reviewed changes

src/datachain/lib/dc/datachain.py Outdated Show resolved Hide resolved

Add usage examples for sum, avg, min and max DataChain methods

ceeacea

shcheklein approved these changes Jul 13, 2025

View reviewed changes

dreadatour added 2 commits July 14, 2025 00:28

Remove unused DataChain methods: reset_schema, add_schema and remove_…

99f439d

…file_signals

Add missing test case for 'not_' function

b4576b4

dreadatour merged commit 08c49ca into main Jul 14, 2025
58 of 59 checks passed

dreadatour deleted the docs-tests-update branch July 14, 2025 04:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update docs + add more tests #1233

Update docs + add more tests #1233

Uh oh!

dreadatour commented Jul 13, 2025 •

edited by sourcery-ai bot

Loading

Uh oh!

sourcery-ai bot commented Jul 13, 2025 •

edited

Loading

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov bot commented Jul 13, 2025 •

edited

Loading

Uh oh!

cloudflare-workers-and-pages bot commented Jul 13, 2025 •

edited

Loading

Uh oh!

Uh oh!

shcheklein Jul 13, 2025

Uh oh!

dreadatour Jul 13, 2025

Uh oh!

Uh oh!

shcheklein left a comment

Uh oh!

Uh oh!

Uh oh!

Update docs + add more tests #1233

Update docs + add more tests #1233

Uh oh!

Conversation

dreadatour commented Jul 13, 2025 • edited by sourcery-ai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by Sourcery

Uh oh!

sourcery-ai bot commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviewer's Guide

File-Level Changes

Interacting with Sourcery

Customizing Your Experience

Getting Help

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

codecov bot commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cloudflare-workers-and-pages bot commented Jul 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying datachain-documentation with Cloudflare Pages

Uh oh!

Uh oh!

shcheklein Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

dreadatour Jul 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shcheklein left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dreadatour commented Jul 13, 2025 •

edited by sourcery-ai bot

Loading

sourcery-ai bot commented Jul 13, 2025 •

edited

Loading

codecov bot commented Jul 13, 2025 •

edited

Loading

cloudflare-workers-and-pages bot commented Jul 13, 2025 •

edited

Loading