-
Notifications
You must be signed in to change notification settings - Fork 254
feat(datafusion): Treat timestamp conversion functions like a cast. #945
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
d3a2bdd
0e07afb
735e7df
20044dc
dca0304
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -17,7 +17,10 @@ | |||||||||||||||||
|
||||||||||||||||||
use std::vec; | ||||||||||||||||||
|
||||||||||||||||||
use datafusion::logical_expr::{Expr, Operator}; | ||||||||||||||||||
use datafusion::arrow::datatypes::{DataType, TimeUnit}; | ||||||||||||||||||
use datafusion::functions::datetime::to_date::ToDateFunc; | ||||||||||||||||||
use datafusion::functions::datetime::to_timestamp::ToTimestampFunc; | ||||||||||||||||||
use datafusion::logical_expr::{Cast, Expr, Operator}; | ||||||||||||||||||
use datafusion::scalar::ScalarValue; | ||||||||||||||||||
use iceberg::expr::{BinaryExpression, Predicate, PredicateOperator, Reference, UnaryExpression}; | ||||||||||||||||||
use iceberg::spec::Datum; | ||||||||||||||||||
|
@@ -119,7 +122,53 @@ fn to_iceberg_predicate(expr: &Expr) -> TransformedResult { | |||||||||||||||||
_ => TransformedResult::NotTransformed, | ||||||||||||||||||
} | ||||||||||||||||||
} | ||||||||||||||||||
Expr::Cast(c) => to_iceberg_predicate(&c.expr), | ||||||||||||||||||
Expr::Cast(c) => { | ||||||||||||||||||
if DataType::Date32 == c.data_type || DataType::Date64 == c.data_type { | ||||||||||||||||||
match c.expr.as_ref() { | ||||||||||||||||||
Expr::Literal(ScalarValue::Utf8(Some(literal))) => { | ||||||||||||||||||
let date = literal.split('T').next(); | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would be reluctant to do any string manipulation here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Instead, I would add support for the datetime format in iceberg-rust/crates/iceberg/src/spec/values.rs Lines 1459 to 1466 in 75c7e8d
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @Fokko currently I think this is a mistake given the name of the function (date_from_str and not datetime_from_str) and its documentation, but I don't think I can change its behaviour so drastically now right? Do you suggest I change it, or create another one that is fix? |
||||||||||||||||||
if let Some(date) = date { | ||||||||||||||||||
return TransformedResult::Literal(Datum::string(date)); | ||||||||||||||||||
} | ||||||||||||||||||
} | ||||||||||||||||||
_ => return TransformedResult::NotTransformed, | ||||||||||||||||||
} | ||||||||||||||||||
} | ||||||||||||||||||
to_iceberg_predicate(&c.expr) | ||||||||||||||||||
} | ||||||||||||||||||
Expr::ScalarFunction(func) => { | ||||||||||||||||||
if func | ||||||||||||||||||
.func | ||||||||||||||||||
.inner() | ||||||||||||||||||
.as_any() | ||||||||||||||||||
.downcast_ref::<ToTimestampFunc>() | ||||||||||||||||||
Fokko marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
.is_some() | ||||||||||||||||||
// More than 1 argument means it's a custom format - not | ||||||||||||||||||
// supported for now | ||||||||||||||||||
&& func.args.len() == 1 | ||||||||||||||||||
{ | ||||||||||||||||||
return to_iceberg_predicate(&Expr::Cast(Cast::new( | ||||||||||||||||||
Box::new(func.args[0].clone()), | ||||||||||||||||||
DataType::Timestamp(TimeUnit::Nanosecond, None), | ||||||||||||||||||
))); | ||||||||||||||||||
} | ||||||||||||||||||
if func | ||||||||||||||||||
.func | ||||||||||||||||||
.inner() | ||||||||||||||||||
.as_any() | ||||||||||||||||||
.downcast_ref::<ToDateFunc>() | ||||||||||||||||||
.is_some() | ||||||||||||||||||
// More than 1 argument means it's a custom format - not | ||||||||||||||||||
// supported for now | ||||||||||||||||||
&& func.args.len() == 1 | ||||||||||||||||||
{ | ||||||||||||||||||
return to_iceberg_predicate(&Expr::Cast(Cast::new( | ||||||||||||||||||
Box::new(func.args[0].clone()), | ||||||||||||||||||
DataType::Date32, | ||||||||||||||||||
))); | ||||||||||||||||||
} | ||||||||||||||||||
TransformedResult::NotTransformed | ||||||||||||||||||
} | ||||||||||||||||||
_ => TransformedResult::NotTransformed, | ||||||||||||||||||
} | ||||||||||||||||||
} | ||||||||||||||||||
|
@@ -403,4 +452,55 @@ mod tests { | |||||||||||||||||
Reference::new("ts").greater_than_or_equal_to(Datum::string("2023-01-05T00:00:00")); | ||||||||||||||||||
assert_eq!(predicate, expected_predicate); | ||||||||||||||||||
} | ||||||||||||||||||
|
||||||||||||||||||
#[test] | ||||||||||||||||||
fn test_to_timestamp_comparison_creates_predicate() { | ||||||||||||||||||
let sql = "ts >= timestamp '2023-01-05T00:00:00'"; | ||||||||||||||||||
let predicate = convert_to_iceberg_predicate(sql).unwrap(); | ||||||||||||||||||
let expected_predicate = | ||||||||||||||||||
Reference::new("ts").greater_than_or_equal_to(Datum::string("2023-01-05T00:00:00")); | ||||||||||||||||||
assert_eq!(predicate, expected_predicate); | ||||||||||||||||||
} | ||||||||||||||||||
omerhadari marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||||||||||
|
||||||||||||||||||
#[test] | ||||||||||||||||||
fn test_to_timestamp_comparison_to_cast_creates_predicate() { | ||||||||||||||||||
let sql = "ts >= CAST('2023-01-05T00:00:00' AS TIMESTAMP)"; | ||||||||||||||||||
let predicate = convert_to_iceberg_predicate(sql).unwrap(); | ||||||||||||||||||
let expected_predicate = | ||||||||||||||||||
Reference::new("ts").greater_than_or_equal_to(Datum::string("2023-01-05T00:00:00")); | ||||||||||||||||||
assert_eq!(predicate, expected_predicate); | ||||||||||||||||||
} | ||||||||||||||||||
|
||||||||||||||||||
#[test] | ||||||||||||||||||
fn test_to_timestamp_with_custom_format_does_not_create_predicate() { | ||||||||||||||||||
let sql = | ||||||||||||||||||
"TO_TIMESTAMP(ts, 'YYYY-DD-MMTmm:HH:SS') >= CAST('2023-01-05T00:00:00' AS TIMESTAMP)"; | ||||||||||||||||||
let predicate = convert_to_iceberg_predicate(sql); | ||||||||||||||||||
assert_eq!(predicate, None); | ||||||||||||||||||
} | ||||||||||||||||||
|
||||||||||||||||||
#[test] | ||||||||||||||||||
fn test_to_date_comparison_creates_predicate() { | ||||||||||||||||||
let sql = "ts >= CAST('2023-01-05T11:11:11' AS DATE)"; | ||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm surprised to see that this works:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Personally I would have preferred an error, but nobody is asking me 😆 |
||||||||||||||||||
let predicate = convert_to_iceberg_predicate(sql).unwrap(); | ||||||||||||||||||
let expected_predicate = | ||||||||||||||||||
Reference::new("ts").greater_than_or_equal_to(Datum::string("2023-01-05")); | ||||||||||||||||||
assert_eq!(predicate, expected_predicate); | ||||||||||||||||||
} | ||||||||||||||||||
|
||||||||||||||||||
#[test] | ||||||||||||||||||
/// When casting to DATE, usually the value is converted to datetime or timestamp, | ||||||||||||||||||
/// and then it is truncated. DayTransform is not yet supported fully here. | ||||||||||||||||||
/// It is specifically implemented for Strings because it is the most common use case. | ||||||||||||||||||
/// When actual support is implemented, this test will fail and should be removed. | ||||||||||||||||||
/// For now it is here in order to make sure that the value from within the cast | ||||||||||||||||||
/// is not used as-is when casting to date, because it can create false predicates. | ||||||||||||||||||
/// | ||||||||||||||||||
/// (Consider for example `ts > CAST('2023-01-05T11:11:11' AS DATE)` which should | ||||||||||||||||||
/// create a different predicate than `ts > CAST('2023-01-05T11:11:11' AS TIMESTAMP)`) | ||||||||||||||||||
fn test_to_date_from_non_string_is_ignored() { | ||||||||||||||||||
let sql = "ts >= CAST(123456789 AS DATE)"; | ||||||||||||||||||
let predicate = convert_to_iceberg_predicate(sql); | ||||||||||||||||||
assert_eq!(predicate, None); | ||||||||||||||||||
} | ||||||||||||||||||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have concerns handling this case in such a way here. This is a process of simplifying expression in datafusion, which should call datafusion's expression simplification api.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is less about simplification, rather than extracting the right information for Iceberg to optimize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the word simplification is a little confusing, I think constant folding is better here. The expressions in tests should be simplified by constant folding and replaced with a literal, which could be processed by iceberg.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Constant folding is being done I think, (e.g.
TO_TIMESTAMP(2+2)
=>
TO_TIMESTAMP(4)
) but I am not sure it makes sense to turnTO_TIMESTAMP(<date_str>)
toCAST(date_str AS TIMESTAMP)
as a simplification in Datafusion. Currently the implementation ofCAST
andTO_TIMESTAMP
is the same with no parameters, but I don't think turningTO_TIMESTAMP
toCAST
is a "simplification" necessarily. It may even be counter intuitive and effect errors in some weird ways, with no gains to readability or performance.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By simplification I don't mean converting
TO_TIMESAMPT(<date_str>)
toCAST(date_str AS TIMESTAMP)
, I mean it should be convertingTO_TIMESAMPT(<date_str>)
to<timestamp literal>
, which should be done in constant folding.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think datafusion's expression simplification api should have done this, you could see this example:
https://github.com/apache/datafusion/blob/132b21f02763cabbab85ab99e0e0c7fd6cd3d516/datafusion-examples/examples/expr_api.rs#L153
https://github.com/apache/datafusion/blob/132b21f02763cabbab85ab99e0e0c7fd6cd3d516/datafusion-examples/examples/expr_api.rs#L180
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I don't mind it being in in DataFusion, I'll try opening an issue/PR there and see.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem a datafusion problem, maybe we should call the simplification api in the iceberg planner?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would be redundant I think. We should call it in the tests to mimic production behaviour.
The current simplification API in DataFusion simplifies casts, not these specific function calls (TO_TIMESTAMP, TO_DATE) as far as I could tell.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about the details, but I guess in iceberg-datafusion planner we missed sth. It's fine for me to seek for help in datafusion community.