Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Choose an official Display format for ColumnName #444

Closed
Tracked by #86
scovich opened this issue Oct 29, 2024 · 6 comments
Closed
Tracked by #86

Choose an official Display format for ColumnName #444

scovich opened this issue Oct 29, 2024 · 6 comments
Labels
enhancement New feature or request

Comments

@scovich
Copy link
Collaborator

scovich commented Oct 29, 2024

Please describe why this is necessary.

Today, column names have no "official" string representation. We probably need to choose one.

Describe the functionality you are proposing.

Possible choices include:

  1. Print their Debug form, "[\"a\", \"b\", \"c\"]"
    • PRO: 100% unambiguous, trivial implementation
    • CON: Super annoying to read -- especially in the common case where the column name contains no special characters
  2. Print their "lossy" form: "a.b.c"
    • PRO: Super easy to implement, super easy to read in the common case where the column name contains no special characters.
    • CON: Potentially lossy if any field names contain special characters (especially '.' or engine-specific escape characters that might "mask" a dot).
  3. Choose some syntax for escaping field names, and use that. For example, "a.[b.c].d" or "a.\"b.c\".d"
    • PRO: Lossless and usually easy to read
    • CON: Have to choose an escaping syntax, which may or may not match what a given engine uses

I personally favor the third approach, using square brackets as the field escape character. Rationale:

  • Everyone wants easily readable the column names in the common case (rules out 1/), but we also want to be precise in the uncommon case (rules out 2/).
  • Square brackets in field names should be vanishingly rare, so we would seldom need to use our escaping capabilities (big plus for readability)
  • The SQL double quote escape syntax is hard to read as a string constant (the escape character must be itself escaped), and readability disintegrates entirely once both forms of escaping are in play at the same time, e.g. "a.\""\"b\"\"\".c".

Additional context

No response

@scovich scovich added the enhancement New feature or request label Oct 29, 2024
@scovich scovich mentioned this issue Oct 29, 2024
6 tasks
@zachschuermann
Copy link
Collaborator

3rd approach lgtm! and also wondering if we could rely on :? (Debug print) vs :#? (Debug alternative - usually more formatting/prettier)?

@nicklan
Copy link
Collaborator

nicklan commented Oct 29, 2024

Do column names actually need to implement Display? As zach suggests, maybe we could do just Debug? Then we could do "[\"a\", \"b\", \"c\"]" for :? and "a.[b.c].d" (or ColName("a.[b.c].d")) for :#?.

Also, if we think engines will want a way to go in and out of string representations, why not have a to_delimited_string or similar that can take either a delimiting style enum, or delimiter start and end chars, so they can get exactly what they want.

This might be over-complicating something that could be easy though, so I'm also okay for now to just go with option 3.

@scovich
Copy link
Collaborator Author

scovich commented Oct 30, 2024

Do column names actually need to implement Display?

We need something because Expression implements Display. It could be implemented in terms of Debug for ColumnName, but whatever we show should ideally be human-readable (since that's the purpose of Display).

@scovich
Copy link
Collaborator Author

scovich commented Oct 30, 2024

wondering if we could rely on :? (Debug print) vs :#? (Debug alternative - usually more formatting/prettier)?

AFAIK :#? is usually bulkier -- "prettier" in the "lots of indentation and newlines" sense. Probably not quite what we want for column names? But maybe we could have :? either do ColumnName::to_string_lossy (basically, path.join(".")), or option 3; while :#? falls back to whatever the derived alt Debug impl would have done?

@scovich
Copy link
Collaborator Author

scovich commented Oct 30, 2024

if we think engines will want a way to go in and out of string representations, why not have a to_delimited_string or similar that can take either a delimiting style enum, or delimiter start and end chars, so they can get exactly what they want.

This might be over-complicating something that could be easy though

I think we'll find that ~every engine has a slightly different quirky approach to parsing column names... really not in kernel's charter to tackle that.

I was more thinking about something from "default engine" perspective -- we can provide a sensible default for engines that want it, which happens to also be useful internally for printing out column names and expressions that involve them.

It's also a potential way to send column names across the FFI boundary: kernel defines the parsing spec we understand, and engine is responsible to provide column names in that format. Requires memory allocation and parsing tho (so not my preferred approach).

@scovich
Copy link
Collaborator Author

scovich commented Nov 7, 2024

Update: Some Delta table properties include column names, emitted in delta-spark format (backticks as escape character). So we're forced do do that unless we want to support two display/parsing styles, and #445 did it.

@scovich scovich closed this as completed Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants