Skip to content

Conversation

@kosiew
Copy link
Contributor

@kosiew kosiew commented Oct 22, 2025

Which issue does this PR close?

Closes #1172

Rationale for this change

When multiple scalar UDFs are chained in Python, the intermediate results lose PyArrow extension metadata.
This happens because the existing binding only passed arrow::datatypes::DataType to Rust’s ScalarUDF, discarding extension information embedded in pyarrow.Field.

This patch ensures that DataFusion’s Python UDF layer preserves the complete field metadata, allowing extension arrays (e.g. arrow.uuid, custom logical types) to survive round-trips between Python and Rust.


What changes are included in this PR?

🔧 Python (python/datafusion/user_defined.py)

  • Introduced PyArrowArray and PyArrowArrayT aliases for unified typing of Array and ChunkedArray.

  • Added normalization utilities:

    • _normalize_field, _normalize_input_fields, _normalize_return_field
    • _wrap_extension_value and _wrap_udf_function to automatically re-wrap extension arrays on UDF input/output.
  • Updated ScalarUDF constructor and decorator overloads to accept both pa.Field and pa.DataType objects.

  • Ensured ScalarUDF passes fully qualified Field objects (with metadata) to the internal layer.

🧰 Rust (src/udf.rs)

  • Added a new PySimpleScalarUDF implementing ScalarUDFImpl:

    • Preserves arrow::datatypes::Field for inputs and return values.
    • Implements return_field_from_args to keep field names and extension metadata.
  • Updated the PyO3 binding to accept and expose Vec<Field> instead of Vec<DataType>.

  • Refactored construction to use ScalarUDF::new_from_impl().

🤖 Tests (python/tests/test_udf.py)

  • Added test_uuid_extension_chain verifying that:

    • Chained UDFs correctly round-trip arrow.uuid arrays.
    • Empty extension arrays are handled without type loss.
    • UDF input/output extension metadata remains intact.

Are these changes tested?

✅ Yes.
The new test suite test_uuid_extension_chain explicitly covers:

  • Chaining of UUID extension UDFs.
  • Handling of empty extension arrays.
  • Type preservation between UDF boundaries.
    Existing decorator and parameterized UDF tests remain intact and continue to pass.

Are there any user-facing changes?

Yes — enhanced behavior for PyArrow extension arrays in Python UDFs.

  • Users can now declare input_types and return_type as either pa.DataType or pa.Field.
  • Chained scalar UDFs now preserve PyArrow extension metadata (e.g. arrow.uuid, custom registered extensions).
  • Existing non-extension UDFs continue to function unchanged.

No breaking API changes are introduced — the update is fully backward-compatible while extending functionality.

kosiew added 11 commits October 22, 2025 18:34
Enhance scalar UDF definitions to retain Arrow Field
information, including extension metadata, in DataFusion.
Normalize Python UDF signatures to accept pyarrow.Field
objects, ensuring metadata survives the Rust bindings roundtrip.
Add a regression test for UUID-backed UDFs to verify
that the second UDF correctly receives a
pyarrow.ExtensionArray, preventing past metadata loss.
Wrap scalar UDF inputs/outputs to maintain extension
types during execution. Enhance UUID extension
regression test to ensure metadata retention and
normalize results for accurate comparison.
…concrete Callable[..., Any] and pa.DataType | pa.Field annotations, removing the lingering references to the deleted _R type variable.
… checkers surface support for pyarrow.Field inputs when defining UDFs
Introduce a shared alias for PyArrowArray and update the
extension wrapping helpers to ensure scalar UDF return types
are preserved when handling PyArrow arrays. Enhance ScalarUDF
signatures, overloads, and documentation to align with the
PyArrow array contract for Python scalar UDFs.
Implement a feature flag to check for UUID helper
in pyarrow. Add conditional skip to the UUID extension
UDF chaining test when the helper is unavailable,
retaining original assertions for supported environments.
Ensure collected UUID results are extension arrays or chunked arrays
with the UUID extension type before comparison to expected values,
preserving end-to-end metadata validation.
Return a wrapped empty extension array for chunked storage
arrays with no chunks, preserving extension metadata.
Expand UUID UDF regression to support chunked inputs,
test empty chunked returns, and ensure UUID extension
type remains intact through UDF chaining.
@kosiew kosiew marked this pull request as ready for review October 22, 2025 13:28
Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take another look at it later. I found the text in user_defined.py to be hard to understand and a lot of it looks like LLM generated text more oriented at a developer rather than user oriented. I noted a couple of things.

I'll take another look when I can dedicate some more time to understand what is going on here.

src/udf.rs Outdated
}

fn return_type(&self, _arg_types: &[DataType]) -> datafusion::error::Result<DataType> {
Ok(self.return_field.data_type().clone())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Best practice is to not implement this method. Per the upstream datafusion documentation:

If you provide an implementation for Self::return_field_from_args, DataFusion will not call return_type (this function). In such cases is recommended to return DataFusionError::Internal.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitting implementation prevents compilation.

error[E0046]: not all trait items implemented, missing: `return_type`
   --> src/udf.rs:124:1
    |
124 | impl ScalarUDFImpl for PySimpleScalarUDF {
    | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ missing `return_type` in implementation

I amended it to return

Err(DataFusionError::Internal(
  "return_type should be unreachable when return_field_from_args is implemented"
    .to_string(),

This list must be of the same length as the number of arguments. Pass
:class:`pyarrow.Field` instances to preserve extension metadata.
return_type (pa.DataType | pa.Field): The return type of the function. Use a
:class:`pyarrow.Field` to preserve metadata on extension arrays.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this comment is misleading. I do not think there is any guarantee that the output field metadata will be preserved. Instead this should be the way in which you can set output metadata. I think it is entirely possible that a UDF implemented like this can still lose the metadata. One case is where you want to consume it on the input side and output some different kind of metadata on your output side.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended.

Comment on lines 135 to 138
@pytest.mark.skipif(
not UUID_EXTENSION_AVAILABLE,
reason="PyArrow uuid extension helper unavailable",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make the uuid extension a requirement in our developer dependencies?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed @pytest.mark.skipif

Ensure return_field_from_args is the only metadata source
by having PySimpleScalarUDF::return_type raise an internal
error. This aligns with DataFusion guidance.

Enhance Python UDF helper documentation to clarify how
callers can declare extension metadata on both arguments
and results.
dev = [
"maturin>=1.8.1",
"numpy>1.25.0",
"pyarrow>=19.0.0",
Copy link
Contributor Author

@kosiew kosiew Oct 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the change added to ensure pa.uuid() is available for test_udf.py.

https://arrow.apache.org/docs/19.0/python/generated/pyarrow.uuid.html is the lowest version which contains pyarrow.uuid.

The rest are VSCode automatic formatting.

Comment on lines +476 to +481
Usage:
- As a function: ``udaf(accum, input_types, return_type, state_type,``
``volatility, name)``.
- As a decorator: ``@udaf(input_types, return_type, state_type,``
``volatility, name)``.
When using ``udaf`` as a decorator, do not pass ``accum`` explicitly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the formatting got changed. Is this intentional?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ScalarUDFs created using datafusion.udf() do not propagate extension type metadata

2 participants