-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add arrow conversions flexibility #300
feat: add arrow conversions flexibility #300
Conversation
When we read parquet Map types in delta-rs, the data gets read as rootname "key_value" and then the struct type has two fields "key", "value". We have casting logic that expects this schema, however when we convert our snapshot.schema to an arrow schema, a mismatch happens because kernel is using other field names. Error when we optimize tables with Map types. ```python SchemaMismatchError: Could not find column keys ``` Theoretically arrow fields can use whatever names they want, but since we are reading parquet data and therefore getting these RecordBatches with that structure, it would be just much easier to keep the structure of the datatypes consistent between the two. See parquet docs: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps
Code looks fine at first glance. Can you explain a bit more what:
actually means? Kernel generally will just "do what the parquet crate" does, when it comes to the types that come back in the default engine. So we should already read into those types. The parquet crate doesn't take a schema at read, so it's not like we need to convert into the arrow schema that says "please read a view" or anything. Is this because you want to use the arrow conversions here in delta-rs to properly convert schema? I think we'd want to have a bit of discussion if that's something kernel should provide/support. We could perhaps hide this behind developer visibility for now or something. |
@nicklan We use datafusion ParquetExec to read parquet data. The ParquetExec receives a fileschema and SchemaAdapter which defines how the data is read. We don't rely on the kernel engine to do that. However we do use kernel for delta -> arrow schema conversions. Currently kernel has a naive assumption that normal arrow types are the types to be used for everything but that's not the case. Polars for example only uses View types for utf8, list and binary. so we need to read parquet data into that arrow type, so that we can prevent casting the incoming batches to the normal types which may not fit. |
Hmm seems that polars doesn't use ListViews though, I might need to rethink the approach. I essentially want that from Python a user can specify how to read the parquet data into arrow, so strings get read as Utf8 / LargeUtf8 or / Utf8View, so that prevent doing costly casts between different flavors of string/list/binary type. I think the idea behind the PR still stands, I need to control how the delta schema get's converted into arrow because we heavily rely on that to know how to read the table as. Perhaps TryFromWithArrowSize should be more granular and size needs to be defined per type , but now I also need to adjust our RecordBatch casting logic |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #300 +/- ##
==========================================
- Coverage 72.51% 72.48% -0.03%
==========================================
Files 43 43
Lines 7783 7815 +32
Branches 7783 7815 +32
==========================================
+ Hits 5644 5665 +21
- Misses 1768 1779 +11
Partials 371 371 ☔ View full report in Codecov by Sentry. |
@nicklan closing this, as I've resolved it by allowing stuff to just pass through in delta-rs |
Okay thanks for letting me know. We do probably need more discussion around this. I'm realizing more and more that these "internal" schema details unfortunately do matter. Ideally we'll have a very clear description of what kernel exposes/provides on that front. I'll keep you posted on developments :) |
Adds a new trait TryFromWithSize so that one can convert a Delta schema to using LargeArrow types or using ArrowView types, All normal TryFrom will keep using Normal Arrow Types.
This is required so that we can read Parquet directly into ArrowView types, for compatibility with polars. I will expose this ArrowTypeSize into Python as well in delta-rs
Related issue: delta-io/delta-rs#2613
Related PR: https://github.com/delta-io/delta-rs/pull/2736/files