Improve python connector test coverage #736

PatrickJin-db · 2025-05-24T02:28:56Z

Adds integration tests for adding columns + updating rows, with and without partitions, for both snapshot and CDF queries.

These tests also caught the following bugs:

CDF queries do not preserve column order (fixed in this PR too)
Delta format query with version specified is broken in reference server #735
Reference server does not return writerFeatures in delta format CDF query response. I'm unsure if the correct behavior is for the server to return writerFeatures, or if the client should even care about writerFeatures. Either way, the consequence is that CDF queries with kernel fail in the integration tests, since kernel enforces that writerFeatures must be included if minWriterVersion >= 7.

PatrickJin-db · 2025-05-24T02:31:48Z

python/delta_sharing/tests/test_delta_sharing.py

+            pd.DataFrame(
+                {
+                    "c1": pd.Series([1, 2, 3], dtype="int32"),
+                    "c2": pd.Series([None, None, 4.0], dtype="object"),


I have noticed that when an integer column has null values, the type it gets converted to is inconsistent: sometimes the column gets cast to float, sometimes to object. Is this a bug too?

though where the bug is? not in our code I assume?

I think it is a bug in our code that relates to pandas type conversion rules. In pandas, integer types are not nullable, so a nullable integer column must use either object or a float type. A pandas column that only contains None will have type object, and if you concatenate with a column of an integer type, the column type remains object. However, if you create a pandas column with both integers and None, the column will instead have a float type. Which of these two cases gets hit would thus depend on if a column was missing/only contained null values in one of the source parquet files.

One question is how strict we should be about our mappings from delta types to pandas types. I'm afraid strictly enforcing type conversions will increase memory usage due to triggering pandas consolidation. Maybe we can introduce a flag for how to handle nullable int columns.

in L1705, 4.0 is a float but not an integer?

but yeah, introducing a flag/parameter sg.

Good point, no need to use 4.0 in that test case since it's object type. I changed it to 4. As for the flag for strict type conversion, can we create an issue and fix it later? It's not urgent and I don't want to block the next release on the type conversion issue.

linzhou-db · 2025-05-27T23:45:38Z

python/delta_sharing/tests/test_delta_sharing.py

+            pd.DataFrame(
+                {
+                    "c1": pd.Series([1, 2, 3], dtype="int32"),
+                    "c2": pd.Series([None, None, 4.0], dtype="object"),


though where the bug is? not in our code I assume?

linzhou-db · 2025-05-27T23:52:19Z

python/delta_sharing/tests/test_delta_sharing.py

@@ -1687,3 +1689,420 @@ def test_load_table_changes_as_spark(
            match="Unable to import pyspark. `load_table_changes_as_spark` requires" + " PySpark.",
        ):
            load_table_changes_as_spark("not-used")
+


could you add comment on what each version does for the table?

…tions, for both snapshot and CDF queries

linzhou-db · 2025-06-05T20:22:45Z

python/delta_sharing/tests/test_delta_sharing.py

+        pytest.param(
+            "share8.default.add_columns_non_partitioned_cdf",
+            3,
+            # table initially contains c1 = [1, 2]


version x contains...
version y added int column...
version z insert a row...

linzhou-db · 2025-06-05T20:26:55Z

python/delta_sharing/reader.py

+        for col in merged.columns:
+            col_map[col.lower()] = col
+
+        return merged[[col_map[field["name"].lower()] for field in schema_with_cdf["fields"]]]


this is to preserve the column order?

and why does the column order matter?

PatrickJin-db commented May 24, 2025

View reviewed changes

PatrickJin-db requested a review from linzhou-db May 24, 2025 02:32

PatrickJin-db force-pushed the PatrickJin-db/improve-tests branch 2 times, most recently from 33fb65d to def66e8 Compare May 24, 2025 02:42

linzhou-db reviewed May 27, 2025

View reviewed changes

PatrickJin-db force-pushed the PatrickJin-db/improve-tests branch from def66e8 to 61e720c Compare May 28, 2025 00:43

PatrickJin-db requested a review from linzhou-db May 28, 2025 00:50

Add tests with adding columns + updating rows, with and without parti…

a3e936b

…tions, for both snapshot and CDF queries

PatrickJin-db force-pushed the PatrickJin-db/improve-tests branch from 61e720c to a3e936b Compare May 28, 2025 01:58

linzhou-db approved these changes Jun 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve python connector test coverage #736

Improve python connector test coverage #736

Uh oh!

PatrickJin-db commented May 24, 2025 •

edited

Loading

Uh oh!

PatrickJin-db May 24, 2025

Uh oh!

linzhou-db May 27, 2025

Uh oh!

PatrickJin-db May 28, 2025 •

edited

Loading

Uh oh!

linzhou-db May 28, 2025

Uh oh!

PatrickJin-db May 28, 2025

Uh oh!

linzhou-db May 27, 2025

Uh oh!

linzhou-db May 27, 2025

Uh oh!

PatrickJin-db May 28, 2025

Uh oh!

linzhou-db Jun 5, 2025

Uh oh!

linzhou-db Jun 5, 2025

Uh oh!

linzhou-db Jun 5, 2025

Uh oh!

Uh oh!

Improve python connector test coverage #736

Are you sure you want to change the base?

Improve python connector test coverage #736

Uh oh!

Conversation

PatrickJin-db commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

PatrickJin-db May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

PatrickJin-db commented May 24, 2025 •

edited

Loading

PatrickJin-db May 28, 2025 •

edited

Loading