-
Notifications
You must be signed in to change notification settings - Fork 200
Improve python connector test coverage #736
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Improve python connector test coverage #736
Conversation
pd.DataFrame( | ||
{ | ||
"c1": pd.Series([1, 2, 3], dtype="int32"), | ||
"c2": pd.Series([None, None, 4.0], dtype="object"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have noticed that when an integer column has null values, the type it gets converted to is inconsistent: sometimes the column gets cast to float, sometimes to object. Is this a bug too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though where the bug is? not in our code I assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is a bug in our code that relates to pandas type conversion rules. In pandas, integer types are not nullable, so a nullable integer column must use either object
or a float type. A pandas column that only contains None
will have type object
, and if you concatenate with a column of an integer type, the column type remains object
. However, if you create a pandas column with both integers and None, the column will instead have a float type. Which of these two cases gets hit would thus depend on if a column was missing/only contained null values in one of the source parquet files.
One question is how strict we should be about our mappings from delta types to pandas types. I'm afraid strictly enforcing type conversions will increase memory usage due to triggering pandas consolidation. Maybe we can introduce a flag for how to handle nullable int columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- in L1705, 4.0 is a float but not an integer?
- but yeah, introducing a flag/parameter sg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, no need to use 4.0 in that test case since it's object type. I changed it to 4. As for the flag for strict type conversion, can we create an issue and fix it later? It's not urgent and I don't want to block the next release on the type conversion issue.
33fb65d
to
def66e8
Compare
pd.DataFrame( | ||
{ | ||
"c1": pd.Series([1, 2, 3], dtype="int32"), | ||
"c2": pd.Series([None, None, 4.0], dtype="object"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
though where the bug is? not in our code I assume?
@@ -1687,3 +1689,420 @@ def test_load_table_changes_as_spark( | |||
match="Unable to import pyspark. `load_table_changes_as_spark` requires" + " PySpark.", | |||
): | |||
load_table_changes_as_spark("not-used") | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add comment on what each version does for the table?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
def66e8
to
61e720c
Compare
…tions, for both snapshot and CDF queries
61e720c
to
a3e936b
Compare
pytest.param( | ||
"share8.default.add_columns_non_partitioned_cdf", | ||
3, | ||
# table initially contains c1 = [1, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version x contains...
version y added int column...
version z insert a row...
for col in merged.columns: | ||
col_map[col.lower()] = col | ||
|
||
return merged[[col_map[field["name"].lower()] for field in schema_with_cdf["fields"]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is to preserve the column order?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and why does the column order matter?
Adds integration tests for adding columns + updating rows, with and without partitions, for both snapshot and CDF queries.
These tests also caught the following bugs: