Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(transforms): remove table identifier inherited from inner query when eliminating distinct on #4286

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion sqlglot/transforms.py
Original file line number Diff line number Diff line change
Expand Up @@ -183,7 +183,7 @@ def eliminate_distinct_on(expression: exp.Expression) -> exp.Expression:
and isinstance(expression.args["distinct"].args["on"], exp.Tuple)
):
distinct_cols = expression.args["distinct"].pop().args["on"].expressions
outer_selects = expression.selects
outer_selects = [field.name for field in expression.selects]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem right, what if field is not a column and is an expression?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, moreover if the original query had both table1.col and table2.col in the projections, this would produce two cols in the new one.

Copy link
Author

@richard-merrick richard-merrick Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly under the current implementation this can only work where outer_selects are unique columns or *, and this isn't changed by this PR.

Currently this function takes an existing query, adds a row count window func to use for distincting, and converts it to a subquery aliased as _t.
The outer query then selects from this subquery where _row_number = 1
Selecting anything other than unique column names without table identifiers or * in the outer select won't resolve. eg:

SELECT [expression logic] AS column_alias 
FROM (
  SELECT 
    [expression logic] AS column_alias, 
    ROW_NUMBER() OVER (...) AS _row_count
  FROM table1
) AS "_t"
WHERE _row_count = 1

^ won't work as presumably the expression is related to the inner query tables, not the outer

SELECT table1.col, table2.col 
FROM (
  SELECT table1.col, table2.col, ROW_COUNT() OVER (...) as _row_count
  FROM table1 inner join table2 ...
) AS "_t" 
WHERE _row_count = 1

^ won't work as the reference to table1, table2 is invalid where the subquery is aliased as "_t"

This PR is solely intended to fix situations where the table identifier is added to the outer query selects, but is referencing the table identifiers from inside the subquery rather than the identifier of the subquery itself.

Obviously let me know if I am misunderstanding this or you think this requires a different solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can just do SELECT * instead for the outer query... Or, try to give unique (e.g. auto-generated) names to the inner projections as a best-effort approach, and then reference those in the outer query.

The SELECT * is probably the simplest approach, albeit introducing non-deterministic behavior, but maybe that's fine for transpilation purposes.

A more sophisticated approach could be to only project columns in the inner query, giving them unique aliases (preserving existing ones or those that have valid "output names" where applicable), and then reproducing whatever complex expression in the outer query by referencing those unique names.

So, if you had a projection t1.x + t2.x + 1, you'd project t1.x and t2.x, giving them unique aliases, and then in the outer query you'd reference them to reconstruct that sum.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But re: what happens today, you're probably right. The transformation seems a bit naive.

row_number = find_new_name(expression.named_selects, "_row_number")
window = exp.Window(this=exp.RowNumber(), partition_by=distinct_cols)
order = expression.args.get("order")
Expand Down
Loading