-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ColumnName tracks a path of field names instead of a simple string #445
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #445 +/- ##
==========================================
+ Coverage 78.08% 78.41% +0.33%
==========================================
Files 55 55
Lines 11605 11806 +201
Branches 11605 11806 +201
==========================================
+ Hits 9062 9258 +196
+ Misses 2043 2041 -2
- Partials 500 507 +7 ☔ View full report in Codecov by Sentry. |
fn wrap_expression(state: &mut KernelExpressionVisitorState, expr: impl Into<Expression>) -> usize { | ||
state.inflight_expressions.insert(expr.into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an opportunistic change that makes the visitor logic simpler
(I noticed it while updating the column name visitor below).
Happy to pull it out as a separate (prefactor) PR if it's unwelcome here, just holler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is okay here, makes sense for the simplification
ffi/src/expressions/kernel.rs
Outdated
@@ -321,7 +321,13 @@ pub unsafe extern "C" fn visit_expression( | |||
visit_expression_scalar(visitor, scalar, sibling_list_id) | |||
} | |||
Expression::Column(name) => { | |||
call!(visitor, visit_column, sibling_list_id, name.as_str().into()) | |||
// TODO: Support nested columns properly! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked as #423
@@ -149,7 +149,7 @@ fn visit_expression_column_impl( | |||
name: DeltaResult<String>, | |||
) -> DeltaResult<usize> { | |||
// TODO: FIXME: This is incorrect if any field name in the column path contains a period. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked as #423
LazyLock::new(|| Some(Arc::new(column_expr!("txn.appId").is_not_null()))); | ||
static META_PREDICATE: LazyLock<Option<ExpressionRef>> = LazyLock::new(|| { | ||
Some(Arc::new( | ||
Expr::column([SET_TRANSACTION_NAME, "appId"]).is_not_null(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit bulkier than the original code, but it relies less on "magic constant" code.
Do we prefer this new way as somewhat safer? Or prefer the old way as more compact?
(several more below)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer safer myself.
.filter_map(|(i, f)| requested_columns.take(f.path()).map(|path| (path, i))) | ||
.filter_map(|(i, f)| { | ||
requested_columns | ||
.take(f.path().parts()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sneaky: HashSet::take
accepts Borrow
as input, and ColumnName
borrows as &[String]
. Thus, we can probe the hash set with a parquet schema ColumnPath
, and the take
returns the corresponding (already-allocated) ColumnName
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :)
} | ||
} | ||
|
||
impl Display for ColumnName { | ||
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { | ||
(**self).fmt(f) | ||
// TODO: Handle non-simple field names better? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked as #444
// TODO: Marked as test-only for now, as a demonstration. Should this be Display for ColumnName, so | ||
// `ColumnName::new(["a", "b.c", "d"])` would print as `"a.[b.c].d"` instead of merely `"a.b.c.d"`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked as #444
/// square brackets may contain arbitrary characters, including periods and spaces. To include a | ||
/// literal square bracket in a field name, escape it by doubling it, e.g. `"[foo]]bar]"` would | ||
/// parse as `foo]bar`. | ||
// TODO: Marked as test-only for now, as a demonstration. Should we officially support it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked as #444
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment re [[
as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flushing a couple comments
LazyLock::new(|| Some(Arc::new(column_expr!("txn.appId").is_not_null()))); | ||
static META_PREDICATE: LazyLock<Option<ExpressionRef>> = LazyLock::new(|| { | ||
Some(Arc::new( | ||
Expr::column([SET_TRANSACTION_NAME, "appId"]).is_not_null(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer safer myself.
/// square brackets may contain arbitrary characters, including periods and spaces. To include a | ||
/// literal square bracket in a field name, escape it by doubling it, e.g. `"[foo]]bar]"` would | ||
/// parse as `foo]bar`. | ||
// TODO: Marked as test-only for now, as a demonstration. Should we officially support it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment re [[
as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm! seems there's an issue in the asan stuff in expression visiting though. maybe a non-deterministic leak?
Naturally it doesn't repro on my laptop... but I see it's a use-after-free on a
and there's only one // utility to turn a slice into a char*
char* allocate_string(const KernelStringSlice slice) {
return strndup(slice.ptr, slice.len);
} Most likely this is a lurking issue with kernel string slice lifetimes. |
It worked! |
pub enum ColumnType { | ||
// A column, selected from the data, as is | ||
Selected(ColumnName), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nicklan -- I believe this was a mistake of mine in the prefactor PR -- ColumnType
intentionally tracks only top-level columns. So there's never a need for nesting and a normal String
field name suffices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow that was some fun parsing code :) just a couple comments
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result { | ||
let mut delim = None; | ||
for s in self.iter() { | ||
use std::fmt::Write as _; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason we can't use std::fmt::Write
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could. The as _
indicates that we never directly refer to the trait -- we just need its methods to be available.
type Chars<'a> = std::iter::Peekable<std::str::Chars<'a>>; | ||
|
||
/// Helper for `impl FromStr for ColumnName`. | ||
fn parse_simple(chars: &mut Chars<'_>) -> DeltaResult<(String, bool)> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docs for bool return value?
@@ -195,9 +195,11 @@ impl DataSkippingFilter { | |||
|
|||
// Build the stats read schema by extracting the column names referenced by the predicate, | |||
// extracting the corresponding field from the table schema, and inserting that field. | |||
// | |||
// TODO: Support nested column names! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still lgtm, one very minor nit
type Err = Error; | ||
|
||
fn from_str(s: &str) -> DeltaResult<Self> { | ||
// Ambiguous case: The empty string `""`could reasonably parse as either `ColumnName::new([""])` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, i think this comment belongs two lines lower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix in the follow-up PR that adds support for parsing column lists, since this code will anyway move.
What changes are proposed in this pull request?
Previously,
ColumnName
tracked a simple string, which could be split at.
to obtain a path of field names. But this breaks down if a field name contains any special characters that might interfere with the interpretation of a dot character.To solve this, update
ColumnName
to track an actual path of field names instead. Update all call sites as needed to support the new idiom.This PR also includes code for reliably parsing strings into
ColumnName
using period as field separator and backticks as delimiters for field names containing special characters, e.g:NOTE: This change does not magically make all operations nesting-aware. For example, code that loops over the field names of a
StructType
will continue to see nested column names as not-matched. Fixing those call sites is left as future work, tho obvious ones are flagged here as TODO.Resolves #443
This PR affects the following public APIs
The "shape" of
ColumnName
changes from string-like to slice-of-string-like, and its methods change accordingly. This change is needed because otherwise we cannot reliably handle arbitrary field names.How was this change tested?
Extensive existing unit tests.