Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ColumnName tracks a path of field names instead of a simple string #445

Merged
merged 13 commits into from
Nov 6, 2024

Conversation

scovich
Copy link
Collaborator

@scovich scovich commented Oct 29, 2024

What changes are proposed in this pull request?

Previously, ColumnName tracked a simple string, which could be split at . to obtain a path of field names. But this breaks down if a field name contains any special characters that might interfere with the interpretation of a dot character.

To solve this, update ColumnName to track an actual path of field names instead. Update all call sites as needed to support the new idiom.

This PR also includes code for reliably parsing strings into ColumnName using period as field separator and backticks as delimiters for field names containing special characters, e.g:

assert_eq!(ColumnName::new(["a", "b c", "d"]).to_string(), "a.`b c`.d");

NOTE: This change does not magically make all operations nesting-aware. For example, code that loops over the field names of a StructType will continue to see nested column names as not-matched. Fixing those call sites is left as future work, tho obvious ones are flagged here as TODO.

Resolves #443

This PR affects the following public APIs

The "shape" of ColumnName changes from string-like to slice-of-string-like, and its methods change accordingly. This change is needed because otherwise we cannot reliably handle arbitrary field names.

How was this change tested?

Extensive existing unit tests.

Copy link

codecov bot commented Oct 29, 2024

Codecov Report

Attention: Patch coverage is 91.64265% with 29 lines in your changes missing coverage. Please review.

Project coverage is 78.41%. Comparing base (0214a96) to head (b9c543e).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
ffi/src/expressions/engine.rs 0.00% 11 Missing ⚠️
kernel/src/expressions/column_names.rs 95.39% 3 Missing and 7 partials ⚠️
kernel/src/engine/parquet_stats_skipping/tests.rs 66.66% 6 Missing ⚠️
ffi/src/expressions/kernel.rs 0.00% 1 Missing ⚠️
kernel/src/engine/parquet_stats_skipping.rs 90.90% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #445      +/-   ##
==========================================
+ Coverage   78.08%   78.41%   +0.33%     
==========================================
  Files          55       55              
  Lines       11605    11806     +201     
  Branches    11605    11806     +201     
==========================================
+ Hits         9062     9258     +196     
+ Misses       2043     2041       -2     
- Partials      500      507       +7     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines +44 to +45
fn wrap_expression(state: &mut KernelExpressionVisitorState, expr: impl Into<Expression>) -> usize {
state.inflight_expressions.insert(expr.into())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an opportunistic change that makes the visitor logic simpler
(I noticed it while updating the column name visitor below).

Happy to pull it out as a separate (prefactor) PR if it's unwelcome here, just holler.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is okay here, makes sense for the simplification

@@ -321,7 +321,13 @@ pub unsafe extern "C" fn visit_expression(
visit_expression_scalar(visitor, scalar, sibling_list_id)
}
Expression::Column(name) => {
call!(visitor, visit_column, sibling_list_id, name.as_str().into())
// TODO: Support nested columns properly!
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked as #423

@@ -149,7 +149,7 @@ fn visit_expression_column_impl(
name: DeltaResult<String>,
) -> DeltaResult<usize> {
// TODO: FIXME: This is incorrect if any field name in the column path contains a period.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked as #423

LazyLock::new(|| Some(Arc::new(column_expr!("txn.appId").is_not_null())));
static META_PREDICATE: LazyLock<Option<ExpressionRef>> = LazyLock::new(|| {
Some(Arc::new(
Expr::column([SET_TRANSACTION_NAME, "appId"]).is_not_null(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit bulkier than the original code, but it relies less on "magic constant" code.

Do we prefer this new way as somewhat safer? Or prefer the old way as more compact?

(several more below)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer safer myself.

.filter_map(|(i, f)| requested_columns.take(f.path()).map(|path| (path, i)))
.filter_map(|(i, f)| {
requested_columns
.take(f.path().parts())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sneaky: HashSet::take accepts Borrow as input, and ColumnName borrows as &[String]. Thus, we can probe the hash set with a parquet schema ColumnPath, and the take returns the corresponding (already-allocated) ColumnName.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :)

kernel/src/expressions/column_names.rs Outdated Show resolved Hide resolved
}
}

impl Display for ColumnName {
fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
(**self).fmt(f)
// TODO: Handle non-simple field names better?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked as #444

Comment on lines 134 to 135
// TODO: Marked as test-only for now, as a demonstration. Should this be Display for ColumnName, so
// `ColumnName::new(["a", "b.c", "d"])` would print as `"a.[b.c].d"` instead of merely `"a.b.c.d"`?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked as #444

/// square brackets may contain arbitrary characters, including periods and spaces. To include a
/// literal square bracket in a field name, escape it by doubling it, e.g. `"[foo]]bar]"` would
/// parse as `foo]bar`.
// TODO: Marked as test-only for now, as a demonstration. Should we officially support it?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked as #444

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment re [[ as above

kernel/src/expressions/column_names.rs Outdated Show resolved Hide resolved
Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flushing a couple comments

LazyLock::new(|| Some(Arc::new(column_expr!("txn.appId").is_not_null())));
static META_PREDICATE: LazyLock<Option<ExpressionRef>> = LazyLock::new(|| {
Some(Arc::new(
Expr::column([SET_TRANSACTION_NAME, "appId"]).is_not_null(),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer safer myself.

kernel/src/expressions/column_names.rs Outdated Show resolved Hide resolved
/// square brackets may contain arbitrary characters, including periods and spaces. To include a
/// literal square bracket in a field name, escape it by doubling it, e.g. `"[foo]]bar]"` would
/// parse as `foo]bar`.
// TODO: Marked as test-only for now, as a demonstration. Should we officially support it?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment re [[ as above

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! seems there's an issue in the asan stuff in expression visiting though. maybe a non-deterministic leak?

@scovich
Copy link
Collaborator Author

scovich commented Oct 30, 2024

seems there's an issue in the asan stuff in expression visiting though. maybe a non-deterministic leak?

Naturally it doesn't repro on my laptop... but I see it's a use-after-free on a strndup call:

SUMMARY: AddressSanitizer: heap-use-after-free ../../../../src/libsanitizer/sanitizer_common/sanitizer_common_interceptors.inc:414 in __interceptor_strndup

and there's only one strndup invocation:

// utility to turn a slice into a char*   
char* allocate_string(const KernelStringSlice slice) {
  return strndup(slice.ptr, slice.len);
}

Most likely this is a lurking issue with kernel string slice lifetimes.
Can we prioritize merging #441 and see if that fixes it?

@scovich
Copy link
Collaborator Author

scovich commented Oct 30, 2024

Most likely this is a lurking issue with kernel string slice lifetimes.
Can we prioritize merging #441 and see if that fixes it?

It worked!

pub enum ColumnType {
// A column, selected from the data, as is
Selected(ColumnName),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nicklan -- I believe this was a mistake of mine in the prefactor PR -- ColumnType intentionally tracks only top-level columns. So there's never a need for nesting and a normal String field name suffices.

@zachschuermann zachschuermann self-requested a review November 5, 2024 18:18
Copy link
Collaborator

@zachschuermann zachschuermann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow that was some fun parsing code :) just a couple comments

fn fmt(&self, f: &mut Formatter<'_>) -> std::fmt::Result {
let mut delim = None;
for s in self.iter() {
use std::fmt::Write as _;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason we can't use std::fmt::Write?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. The as _ indicates that we never directly refer to the trait -- we just need its methods to be available.

type Chars<'a> = std::iter::Peekable<std::str::Chars<'a>>;

/// Helper for `impl FromStr for ColumnName`.
fn parse_simple(chars: &mut Chars<'_>) -> DeltaResult<(String, bool)> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs for bool return value?

@@ -195,9 +195,11 @@ impl DataSkippingFilter {

// Build the stats read schema by extracting the column names referenced by the predicate,
// extracting the corresponding field from the table schema, and inserting that field.
//
// TODO: Support nested column names!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

@nicklan nicklan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still lgtm, one very minor nit

type Err = Error;

fn from_str(s: &str) -> DeltaResult<Self> {
// Ambiguous case: The empty string `""`could reasonably parse as either `ColumnName::new([""])`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, i think this comment belongs two lines lower

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix in the follow-up PR that adds support for parsing column lists, since this code will anyway move.

@scovich scovich merged commit 4466509 into delta-io:main Nov 6, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Store an actual path in ColumnName
3 participants