Replies: 1 comment 6 replies
-
|
Hi @SgtGilly, thanks for the question. If you want to run a Python function over partitions of your data, have you looked into Daft UDFs? In UDFs, you can specify a |
Beta Was this translation helpful? Give feedback.
6 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
I understand there is the
df.iter_partitions()method for the daft dataframe that allows me to process partitions in batches, but it appears my custom logic is still being executed serially when iterating over the partitions.Will daft be providing a capability the enables distributed parallel processing of partitions similarly to how Apache Spark's
mapPartitions()works? I have a use case that aligns well with the Spark'smapPartitions()but I am trying to move away from Spark and use a faster, more user friendly processing engine such as daft. I can elaborate on that use case further if the daft team is unfamiliar withmapPartitions()Additionally, when using
df.iter_partitions()it appears daft is breaking my dataframe into more partitions thandf.num_partitions(). Mydfwas derived from a delta table, so do I need to do adf.into_partitions()ordf.repartition()before iterating? Are these expensive to run on large dataframes (200+ Gb)?Beta Was this translation helpful? Give feedback.
All reactions