Mapping UDFs to Partitions #5150

SgtGilly · 2025-09-05T18:39:24Z

SgtGilly
Sep 5, 2025

I understand there is the df.iter_partitions() method for the daft dataframe that allows me to process partitions in batches, but it appears my custom logic is still being executed serially when iterating over the partitions.

Will daft be providing a capability the enables distributed parallel processing of partitions similarly to how Apache Spark's mapPartitions() works? I have a use case that aligns well with the Spark's mapPartitions() but I am trying to move away from Spark and use a faster, more user friendly processing engine such as daft. I can elaborate on that use case further if the daft team is unfamiliar with mapPartitions()

Additionally, when using df.iter_partitions() it appears daft is breaking my dataframe into more partitions than df.num_partitions(). My df was derived from a delta table, so do I need to do a df.into_partitions() or df.repartition() before iterating? Are these expensive to run on large dataframes (200+ Gb)?

kevinzwang · 2025-09-05T18:44:57Z

kevinzwang
Sep 5, 2025
Maintainer

Hi @SgtGilly, thanks for the question.

If you want to run a Python function over partitions of your data, have you looked into Daft UDFs? In UDFs, you can specify a batch_size to tell Daft how many rows you want in each batch/partition. Would that work for your use case?

6 replies

kevinzwang Sep 5, 2025
Maintainer

The current solution would be to return a struct type column and then call .unnest(). Here's an example:

import daft
from daft import DataType

@daft.udf(
    batch_size=1024,
    return_dtype=DataType.struct({"a": DataType.int64(), "b": DataType.string()})
)
def my_udf(inputs: daft.Series):
    """Batch UDF that returns a struct type"""
    outputs = []
    for value in inputs:
        outputs.append({"a": int(value), "b": str(value)})
    return outputs

df = daft.from_pydict({"input": [1, 2, 3]})

# will show a dataframe with two columns
df.select(my_udf(df["input"]).unnest()).show()

SgtGilly Sep 8, 2025
Author

This is almost what I need!

Sorry, I failed to also mention that I would need to make all fields of my dataframe accessible within my UDFs. My dataframe schema is a mapping of a protobuf message where each column is a field of the message and each row is an actual message. I need to convert each row to a serialized protobuf for further processing by business logic.

kevinzwang Sep 8, 2025
Maintainer

Hm there's no way to super cleanly do that at the moment, you would need to just pass in every column into the UDF.

universalmind303 Sep 8, 2025
Maintainer

It's not the cleanest, but you can wrap all columns into a single struct, giving you the entire "row"

df.select(my_udf(daft.struct(*df.column_names)).unnest())

SgtGilly Sep 10, 2025
Author

Thanks for the replies! I am currently testing this out and will reply back.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Mapping UDFs to Partitions #5150

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 6 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Mapping UDFs to Partitions #5150

Uh oh!

SgtGilly Sep 5, 2025

Replies: 1 comment · 6 replies

Uh oh!

kevinzwang Sep 5, 2025 Maintainer

Uh oh!

kevinzwang Sep 5, 2025 Maintainer

Uh oh!

SgtGilly Sep 8, 2025 Author

Uh oh!

kevinzwang Sep 8, 2025 Maintainer

Uh oh!

universalmind303 Sep 8, 2025 Maintainer

Uh oh!

SgtGilly Sep 10, 2025 Author

SgtGilly
Sep 5, 2025

Replies: 1 comment 6 replies

kevinzwang
Sep 5, 2025
Maintainer

kevinzwang Sep 5, 2025
Maintainer

SgtGilly Sep 8, 2025
Author

kevinzwang Sep 8, 2025
Maintainer

universalmind303 Sep 8, 2025
Maintainer

SgtGilly Sep 10, 2025
Author