RFC to support non-Python/separate environment UDFs #5572
Replies: 5 comments 14 replies
-
|
This is cool and we can maybe check out Clickhouse UDFs for inspiration: https://clickhouse.com/docs/sql-reference/functions/udf Very Unix philosophy - just relies on stdin and stdout |
Beta Was this translation helpful? Give feedback.
-
|
A few months ago I did a POC on IPC based udfs. see #4441 For IPC you don't even need the overhead of flight and grpc. You can just write the arrow buffer via stdin. For like a remote ipc, arrow flight might make more sense though. |
Beta Was this translation helpful? Give feedback.
-
|
Something else I feel is worth at least calling out is UDF's implemented via FFI. |
Beta Was this translation helpful? Give feedback.
-
|
WASI has become mature specification. IMO it truly offers cross programming language benefits with very little overhead. All popular languages are adding support for WASI. |
Beta Was this translation helpful? Give feedback.
-
|
Thanks to kevin. We've indeed encountered some customer needs in this area, such as Shell UDFs and UDF environment isolation. For Container UDFs, I'd like to know if the containers are managed by Daft? Specifically, are container start-up, shutdown, discovery, and elastic scaling controlled manually by the user, or is it controlled by Daft? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
Daft provides a convenient interface to run Python code within a Daft query with our user-defined functions (UDFs). But if you wanted to run UDFs that are not written in Python, or even if they are Python but require a different Python environment than the one running Daft, there is no native way to accomplish that.
Goals
Provide simple and performant methods to run user-defined code that have one or more of the following requirements:
In addition, these APIs should be executor agnostic. In other words, they should work both in the native and distributed runners, and not require, say, specific Ray features.
Out of scope:
Proposal
To accomplish these goals I am proposing a three-layered solution:
Shell Command UDF
Daft should provide a built-in
run_processcommand that takes in one or more arguments (literals or expressions) that will be executed as a spawned process per row.Example:
IPC UDF
I am not yet sure what the actual Python interface will look like for this, but it will allow users to pass in a shell command to run to start up the process, and the input columns to give to the process. Something like:
Then, Daft will send data through stdin and listen to stdout for outputs. We'll specify a thin protocol for this, which will support either or both:
Container UDF
This will be the same as the IPC UDF, except instead of providing a shell command, users can provide a container image to run and communicate to.
Beta Was this translation helpful? Give feedback.
All reactions