Skip to content

Expose execution time zone to UDFs #16573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

findepi
Copy link
Member

@findepi findepi commented Jun 26, 2025

The execution time zone serves as the session zone from SQL semantics
perspective. In particular, it influences how a CAST form timestamp to
timestamp with time zone is performed. This is, however, applied at
query parser stage. The execution/session time zone should be also
available to functions in both invoke and simplify APIs.

relates to

@github-actions github-actions bot added sql SQL Planner logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates core Core DataFusion crate common Related to common crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate spark labels Jun 26, 2025
findepi added 2 commits June 26, 2025 17:59
Previously the field was optional. However, when set to a None value, it
would trigger invalid behavior, with `timestamp with time zone` type
being parsed as if it was just `timestamp` type.
The execution time zone serves as the session zone from SQL semantics
perspective. In particular, it influences how a CAST form `timestamp` to
`timestamp with time zone` is performed. This is, however, applied at
query parser stage. The execution/session time zone should be also
available to functions in both invoke and simplify APIs.
@findepi findepi force-pushed the findepi/expose-zone branch from 7a3c90e to 378bf35 Compare June 26, 2025 16:06
@github-actions github-actions bot added the physical-plan Changes to the physical-plan crate label Jun 26, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi -- I have a few code level comments, but in general I think this approach to threading the timezone down makes a lot of sense to me

FYI @Omega359 - what do you think of this particular approach?

@@ -133,7 +133,7 @@ fn test_evaluate_with_start_time(
date_time: &DateTime<Utc>,
) {
let execution_props =
ExecutionProps::new().with_query_execution_start_time(*date_time);
ExecutionProps::new().with_query_execution_start_time(*date_time, "UTC".into());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this uses "UTC" but the default timezone in session config is "+0:00"

Is this discrepancy intended?

UTC is used in other tests as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought that "+0:00" and "UTC" is the same thing

@@ -61,15 +65,18 @@ impl ExecutionProps {
pub fn with_query_execution_start_time(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is strange to have a function named with_query_execution_start_time that sets both start time and timezone

Would it be possible to make a second separate function for setting the timezone?

) -> Self {
self.query_execution_start_time = query_execution_start_time;
self.query_execution_time_zone = query_execution_time_zone;
self
}

/// Marks the execution of query started timestamp.
/// This also instantiates a new alias generator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment could also use some updating

@@ -311,6 +311,8 @@ pub struct ScalarFunctionArgs {
/// or `return_field_from_args`) when creating the physical expression
/// from the logical expression
pub return_field: FieldRef,
/// The configured execution time zone (a.k.a. session time zone)
pub execution_time_zone: String,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we add a String here it will require another allocation and copy on each function call for each batch

Would it be possible to make this Arc<str> (more similar to a Java string which is refcounted and so less expensive to pass around?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I though that zones are short strings and short strings are cheap to copy. while Arc clone involves atomic CAS (which i believe is unlike in Java).
I don't have numbers to base this on, and may be totally wrong, but i read the title of the following page https://blog.031410.xyz/blog/arc_str_vs_string_is_it_really_faster

If we care about perf here, then even cheaper would be to pass &str here. Requires adding a lifetime to ScalarFunctionArgs (breaking change), and that's likely the only downside.

@Omega359
Copy link
Contributor

Thank you @findepi -- I have a few code level comments, but in general I think this approach to threading the timezone down makes a lot of sense to me

FYI @Omega359 - what do you think of this particular approach?

I'll be honest. I think it's too limited. For example, you still wouldn't have access to any custom config, var providers, etc.

@findepi
Copy link
Member Author

findepi commented Jun 27, 2025

I'll be honest. I think it's too limited. For example, you still wouldn't have access to any custom config, var providers, etc.

I definitely don't need that, maybe because I don't know what good var providers can do for me.

Do we need to know all of the desirable items from the start, or just anticipate we gonna evolve set of what's accessible as the need arises?

  • we wrap the time zone into a new struct. It will store time zone
    • and maybe execution start time for now.
  • The struct will be passed by reference to ScalarFunctionArgs
    • so ScalarFunctionArgs gains a lifetime param, a breaking change
  • this opens the door for easy exposing more info to scalars, without performance overhead, and without future breaking changes

wdyt?

@findepi
Copy link
Member Author

findepi commented Jul 1, 2025

It looks we didn't reach a conclusion here yet.
There is "just a String" option as currently implemented in this PR.
There is "a new struct passed by reference (or Arc)" option as proposed in the comment above.
There is also "add SessionConfig by reference (or Arc)" option (#13519) that I am not very supportive of.

@alamb @Omega359 thoughts?

@alamb
Copy link
Contributor

alamb commented Jul 2, 2025

It looks we didn't reach a conclusion here yet. There is "just a String" option as currently implemented in this PR. There is "a new struct passed by reference (or Arc)" option as proposed in the comment above. There is also "add SessionConfig by reference (or Arc)" option (#13519) that I am not very supportive of.

@alamb @Omega359 thoughts?

So it is still my opinion that copying fields from ConfigProperties into ExecutionProps one at a time will over time result in more and more "hair" (aka we'll have some subset of the fields and it won't be clear why some are included and some are not)

For that reason, I still think figuring out some way to pass ConfigOptions via ExecutionProps is the right thing to do long term.

Also as @Omega359 using ConfigOptions aligns us with the asyn udfs as well.

I was wondering if we can changeSessionState to have an Arc<ConfigOptions> and pass that into ExecutionProperties

We could use Arc::make_mut to implement "copy on write" semantics.

Let me see if I can whip up a demo

@findepi
Copy link
Member Author

findepi commented Jul 3, 2025

So it is still my opinion that copying fields from ConfigProperties into ExecutionProps one at a time will over time result in more and more "hair" (aka we'll have some subset of the fields and it won't be clear why some are included and some are not)

That was definitely my intention -- expose only those fields or state attributes that are relevant. Keep the API surface as small as only sufficient.
(I need blah, so i pass-through blah is a trivial change to make by anyone who needs blah in the future.)

If that's controversial, or problematic, then sure we can expose all the config options like in #16661

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate spark sql SQL Planner
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants