Skip to content

Update docs + add more tests #1233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 14, 2025
Merged

Update docs + add more tests #1233

merged 12 commits into from
Jul 14, 2025

Conversation

dreadatour
Copy link
Contributor

@dreadatour dreadatour commented Jul 13, 2025

Tiny improvements:

  • Add dc.func.not_ function + tests
  • Add/update docs for DataChain methods: reset_schema, add_schema, remove_file_signals, sum, avg, min, max and chunk (+ very minor updates in some other methods)
  • Fix bug in SignalSchema._find_in_tree + add more tests for this method
  • Add more tests for DataChain methods: count, distinct and filter

Summary by Sourcery

Add new conditional helper, improve schema resolution, document and implement aggregation methods, and expand test coverage across DataChain operations

New Features:

  • Add dc.func.not_ conditional function

Bug Fixes:

  • Fix SignalSchema._find_in_tree lookup logic and extend error handling

Enhancements:

  • Enhance and unify docstrings for DataChain methods reset_schema, add_schema, remove_file_signals, sum, avg, min, max, and chunk
  • Expose not_ in func module all

Tests:

  • Add extensive unit tests for DataChain count, distinct, filter, and aggregation methods
  • Add tests for the not_ function in mutates and conditional functions
  • Parameterize resolve error scenarios in SignalSchema tests
  • Add test_column_compute covering sum, avg, min, and max operations on nested data

@dreadatour dreadatour requested review from shcheklein, dmpetrov, a team and Copilot July 13, 2025 06:25
@dreadatour dreadatour self-assigned this Jul 13, 2025
Copy link
Contributor

sourcery-ai bot commented Jul 13, 2025

Reviewer's Guide

This PR introduces a new not_ conditional function with accompanying exports and tests, enriches and standardizes docstrings across various DataChain methods, corrects a traversal bug in SignalSchema._find_in_tree with improved error handling and adds parameterized error tests, and significantly expands unit test coverage for counting, distinct, filtering, and aggregation operations.

File-Level Changes

Change Details Files
Add not_ function support in datachain.func
  • Implement not_ in conditional.py to wrap SQL NOT logic
  • Expose not_ in func/init.py
  • Add unit tests for not_ in mutate operations
  • Extend conditional function tests to include not_ cases
src/datachain/func/conditional.py
src/datachain/func/__init__.py
tests/unit/lib/test_func.py
tests/func/functions/test_conditional.py
Enhance documentation for DataChain methods
  • Expand docstrings and parameter sections for schema operations (reset_schema, add_schema, remove_file_signals)
  • Standardize parameter and return descriptions for data conversion methods (to_pandas, show)
  • Add detailed doc comments for aggregation and utility methods (sum, avg, min, max, sample, chunk, to_list, to_values)
src/datachain/lib/dc/datachain.py
Fix bug in SignalSchema tree traversal
  • Simplify direct path lookup in _find_in_tree and refine traversal loop
  • Tighten error condition to ensure full path consumption
  • Remove outdated special-case branch
  • Introduce parameterized tests for resolve errors across invalid input scenarios
src/datachain/lib/signal_schema.py
tests/unit/lib/test_signal_schema.py
Broaden test coverage for core DataChain operations
  • Add comprehensive tests for count across basic, complex, chained, and in-memory scenarios
  • Introduce extensive distinct tests over simple, nested, multi-column, and error cases
  • Expand filter tests covering comparisons, patterns, logical operators, chaining, and complex objects
  • Include validation of aggregation results (sum, avg, min, max) on nested data models
tests/unit/lib/test_datachain.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @dreadatour - I've reviewed your changes - here's some feedback:

  • Many of the newly added test_count, test_distinct, and test_filter functions follow the same pattern and could be consolidated with pytest.mark.parametrize to reduce duplication and improve readability.
  • There aren’t any tests covering the new reset_schema, add_schema, remove_file_signals, or chunk methods—adding tests for those would help verify their behavior.
  • Docstrings currently mix ‘Args’ and ‘Parameters’ styles—please pick one convention and apply it consistently across all methods.
Prompt for AI Agents
Please address the comments from this code review:
## Overall Comments
- Many of the newly added `test_count`, `test_distinct`, and `test_filter` functions follow the same pattern and could be consolidated with pytest.mark.parametrize to reduce duplication and improve readability.
- There aren’t any tests covering the new `reset_schema`, `add_schema`, `remove_file_signals`, or `chunk` methods—adding tests for those would help verify their behavior.
- Docstrings currently mix ‘Args’ and ‘Parameters’ styles—please pick one convention and apply it consistently across all methods.

## Individual Comments

### Comment 1
<location> `src/datachain/func/conditional.py:294` </location>
<code_context>
     return Func("and", inner=sql_and, cols=cols, args=func_args, result_type=bool)
+
+
+def not_(arg: Union[ColumnElement, Func]) -> Func:
+    """
+    Returns the function that produces NOT of the given expressions.
</code_context>

<issue_to_address>
The handling of string arguments in not_ may be inconsistent with and_/or_.

In not_, both strings and Funcs are added to 'cols', unlike and_ where only strings go to 'cols' and Funcs to 'func_args'. Please standardize argument handling for consistency.
</issue_to_address>

### Comment 2
<location> `tests/unit/lib/test_datachain.py:4291` </location>
<code_context>
+def test_column_compute(test_session):
</code_context>

<issue_to_address>
Missing edge case: sum/avg/min/max on empty columns or all-non-numeric columns.

Please add tests to verify how these aggregation methods behave with empty or all non-numeric columns, ensuring they handle such cases appropriately.
</issue_to_address>

### Comment 3
<location> `tests/unit/lib/test_datachain.py:3019` </location>
<code_context>
+    assert chain.to_values("numbers") == [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+
+
+def test_filter_with_func_operations(test_session):
+    """Test filter with datachain.func operations."""
+    from datachain.func import string
+
+    chain = dc.read_values(
+        names=["Alice", "Bob", "Charlie", "David", "Eva"],
+        ages=[25, 30, 35, 40, 45],
+        session=test_session,
+    )
+
+    # Test string length filter
+    filtered_chain = chain.filter(string.length(C("names")) > 4)
+    assert filtered_chain.count() == 3
+    assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]
+
+
</code_context>

<issue_to_address>
Consider adding filter tests for null/missing values.

Adding such tests will help ensure the filter logic correctly handles null or missing values without errors and produces the expected results.
</issue_to_address>

<suggested_fix>
<<<<<<< SEARCH
def test_filter_with_func_operations(test_session):
    """Test filter with datachain.func operations."""
    from datachain.func import string

    chain = dc.read_values(
        names=["Alice", "Bob", "Charlie", "David", "Eva"],
        ages=[25, 30, 35, 40, 45],
        session=test_session,
    )

    # Test string length filter
    filtered_chain = chain.filter(string.length(C("names")) > 4)
    assert filtered_chain.count() == 3
    assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]


=======
def test_filter_with_func_operations(test_session):
    """Test filter with datachain.func operations."""
    from datachain.func import string

    chain = dc.read_values(
        names=["Alice", "Bob", "Charlie", "David", "Eva"],
        ages=[25, 30, 35, 40, 45],
        session=test_session,
    )

    # Test string length filter
    filtered_chain = chain.filter(string.length(C("names")) > 4)
    assert filtered_chain.count() == 3
    assert filtered_chain.to_values("names") == ["Alice", "Charlie", "David"]

def test_filter_with_null_values(test_session):
    """Test filter operations with null/missing values."""
    from datachain.func import string

    # Include None (null) and missing values
    chain = dc.read_values(
        names=["Alice", None, "Charlie", "", "Eva", None],
        ages=[25, 30, None, 40, 45, None],
        session=test_session,
    )

    # Filter out rows where names is None
    filtered_chain = chain.filter(C("names") != None)
    assert filtered_chain.to_values("names") == ["Alice", "Charlie", "", "Eva"]

    # Filter for rows where ages is None
    null_ages_chain = chain.filter(C("ages") == None)
    assert null_ages_chain.to_values("names") == ["Charlie", None]

    # Filter for non-empty, non-null names with length > 0
    non_empty_names_chain = chain.filter((C("names") != None) & (string.length(C("names")) > 0))
    assert non_empty_names_chain.to_values("names") == ["Alice", "Charlie", "Eva"]

    # Filter for rows where names is missing or empty
    missing_or_empty_names_chain = chain.filter((C("names") == None) | (string.length(C("names")) == 0))
    assert missing_or_empty_names_chain.to_values("names") == [None, "" , None]
>>>>>>> REPLACE

</suggested_fix>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new logical not_ function, expands and fixes tests for schema resolution and DataChain operations, updates documentation for several DataChain methods, and patches a bug in the signal lookup logic.

  • Add and test dc.func.not_ alongside existing boolean functions.
  • Revise _find_in_tree in SignalSchema to better handle unmatched paths.
  • Enhance docs for schema management and aggregation methods; add comprehensive tests for count, distinct, and filter.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
tests/unit/test_func.py Import and test the new not_ function with SQLite skips
tests/unit/lib/test_signal_schema.py Parameterized test_resolve_error to cover more invalid cases
tests/unit/lib/test_datachain.py Bulk addition of count, distinct, filter, and aggregation tests
tests/func/functions/test_conditional.py Rename and extend conditional logic test to include not_
src/datachain/lib/signal_schema.py Refactor _find_in_tree to unify dotted-path lookup and error handling
src/datachain/lib/dc/datachain.py Document reset_schema, add_schema, remove_file_signals, sum, avg, min, max, chunk; import StandardType
src/datachain/func/conditional.py Implement new not_ function wrapping SQLAlchemy’s not_
src/datachain/func/init.py Expose not_ in the public API import list
Comments suppressed due to low confidence (2)

src/datachain/func/conditional.py:294

  • The docstring describes support for string column names, but the signature omits str. Change the annotation to Union[str, ColumnElement, Func] to match intent and keep consistency with and_ and or_.
def not_(arg: Union[ColumnElement, Func]) -> Func:

src/datachain/lib/dc/datachain.py:369

  • [nitpick] The docstring lists parameters but omits a Returns section. Add a Returns: entry (e.g., Self) for clarity and consistency with other methods.
    def reset_schema(self, signals_schema: SignalSchema) -> "Self":

Copy link

codecov bot commented Jul 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 88.70%. Comparing base (8b3c25a) to head (b4576b4).
Report is 2 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1233      +/-   ##
==========================================
+ Coverage   88.66%   88.70%   +0.03%     
==========================================
  Files         153      153              
  Lines       13793    13792       -1     
  Branches     1927     1928       +1     
==========================================
+ Hits        12230    12234       +4     
+ Misses       1109     1103       -6     
- Partials      454      455       +1     
Flag Coverage Δ
datachain 88.63% <100.00%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/datachain/func/__init__.py 100.00% <100.00%> (ø)
src/datachain/func/conditional.py 100.00% <100.00%> (ø)
src/datachain/lib/dc/datachain.py 91.40% <100.00%> (+1.00%) ⬆️
src/datachain/lib/pytorch.py 88.80% <100.00%> (+0.09%) ⬆️
src/datachain/lib/signal_schema.py 96.10% <100.00%> (ø)

... and 1 file with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

cloudflare-workers-and-pages bot commented Jul 13, 2025

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: b4576b4
Status: ✅  Deploy successful!
Preview URL: https://ae362322.datachain-documentation.pages.dev
Branch Preview URL: https://docs-tests-update.datachain-documentation.pages.dev

View logs

"""Compute the minimum of a column.

Parameters:
col: The column to compute the minimum for.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we inline some example of how that column can look like?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated (for sum, avg, min and max methods), please, take a look.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(some modifications might be still needed - e.g. hiding removing weird methods)

@dreadatour dreadatour merged commit 08c49ca into main Jul 14, 2025
58 of 59 checks passed
@dreadatour dreadatour deleted the docs-tests-update branch July 14, 2025 04:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants