feat: add arrow conversions flexibility #300

ion-elgreco · 2024-08-04T23:17:13Z

Adds a new trait TryFromWithSize so that one can convert a Delta schema to using LargeArrow types or using ArrowView types, All normal TryFrom will keep using Normal Arrow Types.

This is required so that we can read Parquet directly into ArrowView types, for compatibility with polars. I will expose this ArrowTypeSize into Python as well in delta-rs

Related issue: delta-io/delta-rs#2613

Related PR: https://github.com/delta-io/delta-rs/pull/2736/files

When we read parquet Map types in delta-rs, the data gets read as rootname "key_value" and then the struct type has two fields "key", "value". We have casting logic that expects this schema, however when we convert our snapshot.schema to an arrow schema, a mismatch happens because kernel is using other field names. Error when we optimize tables with Map types. ```python SchemaMismatchError: Could not find column keys ``` Theoretically arrow fields can use whatever names they want, but since we are reading parquet data and therefore getting these RecordBatches with that structure, it would be just much easier to keep the structure of the datatypes consistent between the two. See parquet docs: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

nicklan · 2024-08-06T00:16:00Z

Code looks fine at first glance. Can you explain a bit more what:

This is required so that we can read Parquet directly into ArrowView types, for compatibility with polars. I will expose this ArrowTypeSize into Python as well in delta-rs

actually means?

Kernel generally will just "do what the parquet crate" does, when it comes to the types that come back in the default engine. So we should already read into those types. The parquet crate doesn't take a schema at read, so it's not like we need to convert into the arrow schema that says "please read a view" or anything.

Is this because you want to use the arrow conversions here in delta-rs to properly convert schema? I think we'd want to have a bit of discussion if that's something kernel should provide/support. We could perhaps hide this behind developer visibility for now or something.

ion-elgreco · 2024-08-06T07:29:15Z

@nicklan We use datafusion ParquetExec to read parquet data. The ParquetExec receives a fileschema and SchemaAdapter which defines how the data is read.

We don't rely on the kernel engine to do that. However we do use kernel for delta -> arrow schema conversions. Currently kernel has a naive assumption that normal arrow types are the types to be used for everything but that's not the case. Polars for example only uses View types for utf8, list and binary. so we need to read parquet data into that arrow type, so that we can prevent casting the incoming batches to the normal types which may not fit.

ion-elgreco · 2024-08-06T08:34:52Z

Hmm seems that polars doesn't use ListViews though, I might need to rethink the approach.

I essentially want that from Python a user can specify how to read the parquet data into arrow, so strings get read as Utf8 / LargeUtf8 or / Utf8View, so that prevent doing costly casts between different flavors of string/list/binary type.

I think the idea behind the PR still stands, I need to control how the delta schema get's converted into arrow because we heavily rely on that to know how to read the table as. Perhaps TryFromWithArrowSize should be more granular and size needs to be defined per type

, but now I also need to adjust our RecordBatch casting logic

codecov · 2024-08-07T08:13:31Z

Codecov Report

Attention: Patch coverage is 62.22222% with 17 lines in your changes missing coverage. Please review.

Project coverage is 72.48%. Comparing base (e0a2e5e) to head (ba40388).

Files	Patch %	Lines
kernel/src/engine/arrow_conversion.rs	62.22%	11 Missing and 6 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #300      +/-   ##
==========================================
- Coverage   72.51%   72.48%   -0.03%     
==========================================
  Files          43       43              
  Lines        7783     7815      +32     
  Branches     7783     7815      +32     
==========================================
+ Hits         5644     5665      +21     
- Misses       1768     1779      +11     
  Partials      371      371

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ion-elgreco · 2024-08-07T17:36:19Z

@nicklan closing this, as I've resolved it by allowing stuff to just pass through in delta-rs

nicklan · 2024-08-07T18:33:47Z

@nicklan closing this, as I've resolved it by allowing stuff to just pass through in delta-rs

Okay thanks for letting me know.

We do probably need more discussion around this. I'm realizing more and more that these "internal" schema details unfortunately do matter.

Ideally we'll have a very clear description of what kernel exposes/provides on that front. I'll keep you posted on developments :)

ion-elgreco added 5 commits August 5, 2024 00:39

handle view types

654e728

add TryFromWithSize trait

927c1c2

make arrow_conversion pub

0d97040

Merge branch 'main' into feat/improved_arrow_conversions

5abd0fe

ion-elgreco added 4 commits August 6, 2024 09:11

add default

e45a862

default derive

cfeff2e

make trait public

13286dc

use size as var name

c4f2381

add debug derive

5e5654c

Merge branch 'main' into feat/improved_arrow_conversions

ba40388

ion-elgreco closed this Aug 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add arrow conversions flexibility #300

feat: add arrow conversions flexibility #300

ion-elgreco commented Aug 4, 2024 •

edited

Loading

nicklan commented Aug 6, 2024 •

edited

Loading

ion-elgreco commented Aug 6, 2024

ion-elgreco commented Aug 6, 2024 •

edited

Loading

codecov bot commented Aug 7, 2024

ion-elgreco commented Aug 7, 2024

nicklan commented Aug 7, 2024

feat: add arrow conversions flexibility #300

feat: add arrow conversions flexibility #300

Conversation

ion-elgreco commented Aug 4, 2024 • edited Loading

nicklan commented Aug 6, 2024 • edited Loading

ion-elgreco commented Aug 6, 2024

ion-elgreco commented Aug 6, 2024 • edited Loading

codecov bot commented Aug 7, 2024

Codecov Report

ion-elgreco commented Aug 7, 2024

nicklan commented Aug 7, 2024

ion-elgreco commented Aug 4, 2024 •

edited

Loading

nicklan commented Aug 6, 2024 •

edited

Loading

ion-elgreco commented Aug 6, 2024 •

edited

Loading