Skip to content

Improve python connector test coverage #736

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

PatrickJin-db
Copy link
Collaborator

@PatrickJin-db PatrickJin-db commented May 24, 2025

Adds integration tests for adding columns + updating rows, with and without partitions, for both snapshot and CDF queries.

These tests also caught the following bugs:

  • CDF queries do not preserve column order (fixed in this PR too)
  • Delta format query with version specified is broken in reference server #735
  • Reference server does not return writerFeatures in delta format CDF query response. I'm unsure if the correct behavior is for the server to return writerFeatures, or if the client should even care about writerFeatures. Either way, the consequence is that CDF queries with kernel fail in the integration tests, since kernel enforces that writerFeatures must be included if minWriterVersion >= 7.

pd.DataFrame(
{
"c1": pd.Series([1, 2, 3], dtype="int32"),
"c2": pd.Series([None, None, 4.0], dtype="object"),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have noticed that when an integer column has null values, the type it gets converted to is inconsistent: sometimes the column gets cast to float, sometimes to object. Is this a bug too?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though where the bug is? not in our code I assume?

Copy link
Collaborator Author

@PatrickJin-db PatrickJin-db May 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a bug in our code that relates to pandas type conversion rules. In pandas, integer types are not nullable, so a nullable integer column must use either object or a float type. A pandas column that only contains None will have type object, and if you concatenate with a column of an integer type, the column type remains object. However, if you create a pandas column with both integers and None, the column will instead have a float type. Which of these two cases gets hit would thus depend on if a column was missing/only contained null values in one of the source parquet files.

One question is how strict we should be about our mappings from delta types to pandas types. I'm afraid strictly enforcing type conversions will increase memory usage due to triggering pandas consolidation. Maybe we can introduce a flag for how to handle nullable int columns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • in L1705, 4.0 is a float but not an integer?
  • but yeah, introducing a flag/parameter sg.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, no need to use 4.0 in that test case since it's object type. I changed it to 4. As for the flag for strict type conversion, can we create an issue and fix it later? It's not urgent and I don't want to block the next release on the type conversion issue.

@PatrickJin-db PatrickJin-db requested a review from linzhou-db May 24, 2025 02:32
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/improve-tests branch 2 times, most recently from 33fb65d to def66e8 Compare May 24, 2025 02:42
pd.DataFrame(
{
"c1": pd.Series([1, 2, 3], dtype="int32"),
"c2": pd.Series([None, None, 4.0], dtype="object"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

though where the bug is? not in our code I assume?

@@ -1687,3 +1689,420 @@ def test_load_table_changes_as_spark(
match="Unable to import pyspark. `load_table_changes_as_spark` requires" + " PySpark.",
):
load_table_changes_as_spark("not-used")

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add comment on what each version does for the table?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/improve-tests branch from def66e8 to 61e720c Compare May 28, 2025 00:43
@PatrickJin-db PatrickJin-db requested a review from linzhou-db May 28, 2025 00:50
@PatrickJin-db PatrickJin-db force-pushed the PatrickJin-db/improve-tests branch from 61e720c to a3e936b Compare May 28, 2025 01:58
pytest.param(
"share8.default.add_columns_non_partitioned_cdf",
3,
# table initially contains c1 = [1, 2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version x contains...
version y added int column...
version z insert a row...

for col in merged.columns:
col_map[col.lower()] = col

return merged[[col_map[field["name"].lower()] for field in schema_with_cdf["fields"]]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is to preserve the column order?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and why does the column order matter?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants