feat: arrow convenience extensions #827

roeap · 2025-04-11T11:56:04Z

What changes are proposed in this pull request?

The PR introduces some convenience APIs for engines working with arrow data. Specifically we define and implement ScanExt and ExpressionEvaluatorExt which define variants of the main apis for Scan and ExpressionEvaluator respectively in terms of arrow RecordBatches.

PR #621 contains some similar work in defining a convenience function to handle Scan::execute results. In this PR a TryFrom impl is used - I was a bit unsure which approach would be better.

see: #826

also includes one cargo clippy.

This PR affects the following public APIs

new public methods when traits are in scope Scan::scan_metadata_arrow, Scan::evaluate_arrow and ExpressionEvaluator::evaluate_arrow.

How was this change tested?

additional unit tests for new APIs.

codecov · 2025-04-11T12:07:09Z

Codecov Report

Attention: Patch coverage is 76.74419% with 10 lines in your changes missing coverage. Please review.

Project coverage is 84.99%. Comparing base (e74d18b) to head (873f7ce).

Files with missing lines	Patch %	Lines
kernel/src/engine/arrow_extensions/scan.rs	78.94%	0 Missing and 8 partials ⚠️
kernel/src/engine/arrow_extensions/evaluator.rs	60.00%	1 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #827      +/-   ##
==========================================
- Coverage   85.01%   84.99%   -0.02%     
==========================================
  Files          84       86       +2     
  Lines       20656    20699      +43     
  Branches    20656    20699      +43     
==========================================
+ Hits        17561    17594      +33     
- Misses       2228     2229       +1     
- Partials      867      876       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

scovich

It would be nice if all the new extension methods had actual use sites, to give a better sense of how useful they are? Right now only execute_arrow has a real use site.

scovich · 2025-04-11T16:35:46Z

kernel/src/engine/arrow_extensions/evaluator.rs

+    fn evaluate_arrow(&self, batch: RecordBatch) -> DeltaResult<RecordBatch>;
+}
+
+impl<T: ExpressionEvaluator + ?Sized> ExpressionEvaluatorExt for T {


Why ?Sized? Are there dyn impl somewhere?

Or do we need that in order to invoke the associated function T::evaluate?

scovich · 2025-04-11T16:46:33Z

kernel/src/engine/arrow_extensions/scan.rs

+        let record_batch = ArrowEngineData::try_from_engine_data(data)?.into();
+        mask.map(|m| Ok(filter_record_batch(&record_batch, &m.into())?))
+            .unwrap_or(Ok(record_batch))


Is this a good use for Option::map_or_else?

Suggested change

let record_batch = ArrowEngineData::try_from_engine_data(data)?.into();

mask.map(|m| Ok(filter_record_batch(&record_batch, &m.into())?))

.unwrap_or(Ok(record_batch))

let record_batch = ArrowEngineData::try_from_engine_data(data)?.into();

mask.map_or_else(

|| Ok(record_batch),

|m| Ok(filter_record_batch(&record_batch, &m.into())?),

}

Tho simple imperative code probably wins on readability:

Suggested change

let record_batch = ArrowEngineData::try_from_engine_data(data)?.into();

mask.map(|m| Ok(filter_record_batch(&record_batch, &m.into())?))

.unwrap_or(Ok(record_batch))

let record_batch = ArrowEngineData::try_from_engine_data(data)?.into();

Ok(match mask {

Some(m) => filter_record_batch(&record_batch, &m.into())?,

None => record_batch,

})

or even

Suggested change

let record_batch = ArrowEngineData::try_from_engine_data(data)?.into();

mask.map(|m| Ok(filter_record_batch(&record_batch, &m.into())?))

.unwrap_or(Ok(record_batch))

let mut record_batch = ArrowEngineData::try_from_engine_data(data)?.into();

if let Some(m) = mask {

record_batch = filter_record_batch(&record_batch, &m.into())?;

}

Ok(record_batch)

scovich · 2025-04-11T16:57:05Z

kernel/src/engine/arrow_extensions/scan.rs

+            .map_ok(TryFrom::try_from)
+            .flatten())


IIRC, map_ok and flatten are a bad combination -- Err cases are silently dropped because they are treated as empty iterators. Does this work?

Suggested change

.map_ok(TryFrom::try_from)

.flatten())

.map(|result| Ok(result?.try_into()?))

.flatten_ok()

(depending on the error types, you might be able to drop the Ok(...?) wrapper)

(again below)

github-actions bot assigned roeap Apr 11, 2025

feat: arrow convenience extensions

d6e9c4e

roeap force-pushed the arrow-extensions branch from 16bce4e to d6e9c4e Compare April 11, 2025 11:59

chore: clippy

873f7ce

roeap requested review from nicklan, scovich and zachschuermann April 11, 2025 12:08

roeap mentioned this pull request Mar 7, 2025

[tracking] Kernelize! delta-io/delta-rs#3298

Open

scovich reviewed Apr 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: arrow convenience extensions #827

feat: arrow convenience extensions #827

roeap commented Apr 11, 2025 •

edited

Loading

codecov bot commented Apr 11, 2025

scovich left a comment

scovich Apr 11, 2025

scovich Apr 11, 2025

scovich Apr 11, 2025

scovich Apr 11, 2025

feat: arrow convenience extensions #827

Are you sure you want to change the base?

feat: arrow convenience extensions #827

Conversation

roeap commented Apr 11, 2025 • edited Loading

What changes are proposed in this pull request?

This PR affects the following public APIs

How was this change tested?

codecov bot commented Apr 11, 2025

Codecov Report

scovich left a comment

Choose a reason for hiding this comment

scovich Apr 11, 2025

Choose a reason for hiding this comment

scovich Apr 11, 2025

Choose a reason for hiding this comment

scovich Apr 11, 2025

Choose a reason for hiding this comment

scovich Apr 11, 2025

Choose a reason for hiding this comment

roeap commented Apr 11, 2025 •

edited

Loading