Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add arrow conversions flexibility #300

Closed

Conversation

ion-elgreco
Copy link
Contributor

@ion-elgreco ion-elgreco commented Aug 4, 2024

Adds a new trait TryFromWithSize so that one can convert a Delta schema to using LargeArrow types or using ArrowView types, All normal TryFrom will keep using Normal Arrow Types.

This is required so that we can read Parquet directly into ArrowView types, for compatibility with polars. I will expose this ArrowTypeSize into Python as well in delta-rs

Related issue: delta-io/delta-rs#2613

Related PR: https://github.com/delta-io/delta-rs/pull/2736/files

When we read parquet Map types in delta-rs, the data gets read as
rootname "key_value" and then the struct type has two fields "key",
"value". We have casting logic that expects this schema, however when we
convert our snapshot.schema to an arrow schema, a mismatch happens
because kernel is using other field names.

Error when we optimize tables with Map types.
```python
SchemaMismatchError: Could not find column keys
```

Theoretically arrow fields can use whatever names they want, but since
we are reading parquet data and therefore getting these RecordBatches
with that structure, it would be just much easier to keep the structure
of the datatypes consistent between the two.

See parquet docs:

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps
@nicklan
Copy link
Collaborator

nicklan commented Aug 6, 2024

Code looks fine at first glance. Can you explain a bit more what:

This is required so that we can read Parquet directly into ArrowView types, for compatibility with polars. I will expose this ArrowTypeSize into Python as well in delta-rs

actually means?

Kernel generally will just "do what the parquet crate" does, when it comes to the types that come back in the default engine. So we should already read into those types. The parquet crate doesn't take a schema at read, so it's not like we need to convert into the arrow schema that says "please read a view" or anything.

Is this because you want to use the arrow conversions here in delta-rs to properly convert schema? I think we'd want to have a bit of discussion if that's something kernel should provide/support. We could perhaps hide this behind developer visibility for now or something.

@ion-elgreco
Copy link
Contributor Author

@nicklan We use datafusion ParquetExec to read parquet data. The ParquetExec receives a fileschema and SchemaAdapter which defines how the data is read.

We don't rely on the kernel engine to do that. However we do use kernel for delta -> arrow schema conversions. Currently kernel has a naive assumption that normal arrow types are the types to be used for everything but that's not the case. Polars for example only uses View types for utf8, list and binary. so we need to read parquet data into that arrow type, so that we can prevent casting the incoming batches to the normal types which may not fit.

@ion-elgreco
Copy link
Contributor Author

ion-elgreco commented Aug 6, 2024

Hmm seems that polars doesn't use ListViews though, I might need to rethink the approach.

I essentially want that from Python a user can specify how to read the parquet data into arrow, so strings get read as Utf8 / LargeUtf8 or / Utf8View, so that prevent doing costly casts between different flavors of string/list/binary type.


I think the idea behind the PR still stands, I need to control how the delta schema get's converted into arrow because we heavily rely on that to know how to read the table as. Perhaps TryFromWithArrowSize should be more granular and size needs to be defined per type

, but now I also need to adjust our RecordBatch casting logic

Copy link

codecov bot commented Aug 7, 2024

Codecov Report

Attention: Patch coverage is 62.22222% with 17 lines in your changes missing coverage. Please review.

Project coverage is 72.48%. Comparing base (e0a2e5e) to head (ba40388).

Files Patch % Lines
kernel/src/engine/arrow_conversion.rs 62.22% 11 Missing and 6 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #300      +/-   ##
==========================================
- Coverage   72.51%   72.48%   -0.03%     
==========================================
  Files          43       43              
  Lines        7783     7815      +32     
  Branches     7783     7815      +32     
==========================================
+ Hits         5644     5665      +21     
- Misses       1768     1779      +11     
  Partials      371      371              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ion-elgreco
Copy link
Contributor Author

@nicklan closing this, as I've resolved it by allowing stuff to just pass through in delta-rs

@ion-elgreco ion-elgreco closed this Aug 7, 2024
@nicklan
Copy link
Collaborator

nicklan commented Aug 7, 2024

@nicklan closing this, as I've resolved it by allowing stuff to just pass through in delta-rs

Okay thanks for letting me know.

We do probably need more discussion around this. I'm realizing more and more that these "internal" schema details unfortunately do matter.

Ideally we'll have a very clear description of what kernel exposes/provides on that front. I'll keep you posted on developments :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants