You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently when working with the foreign function interface, we use protobuf parsing to pass back and forth many things which do not have FFI representations. For example in FFI_TableProvdider when we need to evaluate push down filters pass the expression as serialized bytes. This has been very helpful in getting us up and running on table providers. In fact we now have multiple users of this approach
These are the ones I know about. I suspect the actual list is even longer!
The Problem
The problem we have is twofold:
We create large libraries because we create a SessionContext at any point where we need to do the serialization and deserialization. This is because we need to know how to serialize/deserialize functions. This happens in places like parse_exprs and others.
The default SessionContext we use does not know anything about custom functions. So this method will fail when passing across custom registered functions.
In general the protobuf code requires TaskContext, FunctionRegistry, LogicalExtensionCodec, and PhysicalExtensionCodec.
I have been working on a few implementations. I have branches on my local machine where I have implemented FFI equivalents for PhysicalExpr, FunctionRegistry, and Session. I believe we can also implement a limited version of the extension codecs. With these implementations we could pass around something like a FFI_Session where the code can then get the task context.
The core implementation problem:
Suppose I want to create a FFI_TableProvider and what we want to do is pass along the FFI_Session. I might need to instead pass FFI_FunctionRegistry, or both - depends on the need of which FFI interface we're talking about. This means I am cloning an Arc<dyn Session> under the hood when I create my FFI_TableProvider. But then I will be registering my table provider with my session context. So now I have a loop on the Arc. The Session Context holds the table provider which holds the session context.
Please correct the above thinking if I've overlooked something.
Proposed solutions
These are the approaches I've been considering:
Push forward with passing along the FFI_Session, FFI_FunctionRegistry, etc when creating the FFI_TableProvider (or others). This means a pretty decent breaking change to the existing implementations, but it may in the long term be well worth the effort. This would mean resolving my concern above.
Create some kind of singleton in these libraries and have the users call setting functions on it to pass along the function registry, session, etc. I really don't like this idea because it violates some of the ideas I have about what it means to hold one or many SessionContexts. In practice, this may not actually be a problem. I'm open to input.
Keep going further with the FFI coverage to include logical expressions. Since I have PhysicalExpr trait implemented, it's one more thing to duplicate and maintain. If we do this then we can basically remove the protobuf serialization with the notable exception of ScalarValue. That comes from the datafusion-proto-common crate which is relatively light.
Maybe there is another solution or a misunderstanding in my evaluation. I'm open to all input and insight.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Current Status
Currently when working with the foreign function interface, we use protobuf parsing to pass back and forth many things which do not have FFI representations. For example in
FFI_TableProvdiderwhen we need to evaluate push down filters pass the expression as serialized bytes. This has been very helpful in getting us up and running on table providers. In fact we now have multiple users of this approachThese are the ones I know about. I suspect the actual list is even longer!
The Problem
The problem we have is twofold:
SessionContextat any point where we need to do the serialization and deserialization. This is because we need to know how to serialize/deserialize functions. This happens in places likeparse_exprsand others.SessionContextwe use does not know anything about custom functions. So this method will fail when passing across custom registered functions.In general the protobuf code requires
TaskContext,FunctionRegistry,LogicalExtensionCodec, andPhysicalExtensionCodec.I have been working on a few implementations. I have branches on my local machine where I have implemented FFI equivalents for
PhysicalExpr,FunctionRegistry, andSession. I believe we can also implement a limited version of the extension codecs. With these implementations we could pass around something like aFFI_Sessionwhere the code can then get the task context.The core implementation problem:
Suppose I want to create a
FFI_TableProviderand what we want to do is pass along theFFI_Session. I might need to instead passFFI_FunctionRegistry, or both - depends on the need of which FFI interface we're talking about. This means I am cloning anArc<dyn Session>under the hood when I create myFFI_TableProvider. But then I will be registering my table provider with my session context. So now I have a loop on theArc. The Session Context holds the table provider which holds the session context.Please correct the above thinking if I've overlooked something.
Proposed solutions
These are the approaches I've been considering:
FFI_Session,FFI_FunctionRegistry, etc when creating theFFI_TableProvider(or others). This means a pretty decent breaking change to the existing implementations, but it may in the long term be well worth the effort. This would mean resolving my concern above.SessionContexts. In practice, this may not actually be a problem. I'm open to input.PhysicalExprtrait implemented, it's one more thing to duplicate and maintain. If we do this then we can basically remove the protobuf serialization with the notable exception ofScalarValue. That comes from thedatafusion-proto-commoncrate which is relatively light.Maybe there is another solution or a misunderstanding in my evaluation. I'm open to all input and insight.
Beta Was this translation helpful? Give feedback.
All reactions