-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(transforms): remove table identifier inherited from inner query when eliminating distinct on #4286
Conversation
…hen eliminating distinct on
@@ -183,7 +183,7 @@ def eliminate_distinct_on(expression: exp.Expression) -> exp.Expression: | |||
and isinstance(expression.args["distinct"].args["on"], exp.Tuple) | |||
): | |||
distinct_cols = expression.args["distinct"].pop().args["on"].expressions | |||
outer_selects = expression.selects | |||
outer_selects = [field.name for field in expression.selects] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't seem right, what if field is not a column and is an expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, moreover if the original query had both table1.col
and table2.col
in the projections, this would produce two col
s in the new one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm understanding correctly under the current implementation this can only work where outer_selects are unique columns or *, and this isn't changed by this PR.
Currently this function takes an existing query, adds a row count window func to use for distincting, and converts it to a subquery aliased as _t
.
The outer query then selects from this subquery where _row_number = 1
Selecting anything other than unique column names without table identifiers or * in the outer select won't resolve. eg:
SELECT [expression logic] AS column_alias
FROM (
SELECT
[expression logic] AS column_alias,
ROW_NUMBER() OVER (...) AS _row_count
FROM table1
) AS "_t"
WHERE _row_count = 1
^ won't work as presumably the expression is related to the inner query tables, not the outer
SELECT table1.col, table2.col
FROM (
SELECT table1.col, table2.col, ROW_COUNT() OVER (...) as _row_count
FROM table1 inner join table2 ...
) AS "_t"
WHERE _row_count = 1
^ won't work as the reference to table1, table2 is invalid where the subquery is aliased as "_t"
This PR is solely intended to fix situations where the table identifier is added to the outer query selects, but is referencing the table identifiers from inside the subquery rather than the identifier of the subquery itself.
Obviously let me know if I am misunderstanding this or you think this requires a different solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can just do SELECT *
instead for the outer query... Or, try to give unique (e.g. auto-generated) names to the inner projections as a best-effort approach, and then reference those in the outer query.
The SELECT *
is probably the simplest approach, albeit introducing non-deterministic behavior, but maybe that's fine for transpilation purposes.
A more sophisticated approach could be to only project columns in the inner query, giving them unique aliases (preserving existing ones or those that have valid "output names" where applicable), and then reproducing whatever complex expression in the outer query by referencing those unique names.
So, if you had a projection t1.x + t2.x + 1
, you'd project t1.x
and t2.x
, giving them unique aliases, and then in the outer query you'd reference them to reconstruct that sum.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But re: what happens today, you're probably right. The transformation seems a bit naive.
can you write a test? |
I'll go ahead and close this for now, I'm not sure if the approach is right. We probably wanna solve this problem more generically. Thanks for the contribution though, I'll take a look soon and ping you when there's a fix. |
The hard coded alias (
_t
) is set for the window subquery.The outer_selects can be applied using their original table identifiers, particularly in queries with subqueries or complication, which generates invalid SQL.
eg
Sample query:
When transpiled for Redshift returns:
With the additional logic to select only the
name
attribute ofexpression.selects
this is avoided, without impacting other uses.All tests in
make check
pass