Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(transforms): remove table identifier inherited from inner query when eliminating distinct on #4286

Closed
wants to merge 1 commit into from

Conversation

richard-merrick
Copy link

@richard-merrick richard-merrick commented Oct 24, 2024

The hard coded alias (_t) is set for the window subquery.
The outer_selects can be applied using their original table identifiers, particularly in queries with subqueries or complication, which generates invalid SQL.
eg
Sample query:

SELECT DISTINCT ON (id)
    table1.*
FROM
    schema1.table1
    LEFT JOIN (SELECT
                   table2.id
                 , COUNT(1) AS rule_0_count
               FROM
                   schema1.table2
               WHERE (table2.type ILIKE 'type' AND table2.timestamp > CURRENT_DATE - INTERVAL '30 day')
               GROUP BY table2.id) rule_0 ON table1.id = rule_0.id
WHERE
    (COALESCE(rule_0_count, 0) = 0 OR rule_0_count IS NULL)

When transpiled for Redshift returns:

SELECT
    table1.*
FROM
    (SELECT
         table1.*
       , ROW_NUMBER() OVER (PARTITION BY id ORDER BY id) AS _row_number
     FROM
         schema1.table1
         LEFT JOIN (SELECT
                        table2.id
                      , COUNT(1) AS rule_0_count
                    FROM
                        schema1.table2
                    WHERE (table2."type" ILIKE 'type' AND table2."timestamp" > CURRENT_DATE - INTERVAL '30 DAY')
                    GROUP BY table2.id) AS rule_0 ON table1.id = rule_0.id
     WHERE (COALESCE(rule_0_count, 0) = 0 OR rule_0_count IS NULL)) AS _t
WHERE
    _row_number = 1

With the additional logic to select only the name attribute of expression.selects this is avoided, without impacting other uses.

All tests in make check pass

@@ -183,7 +183,7 @@ def eliminate_distinct_on(expression: exp.Expression) -> exp.Expression:
and isinstance(expression.args["distinct"].args["on"], exp.Tuple)
):
distinct_cols = expression.args["distinct"].pop().args["on"].expressions
outer_selects = expression.selects
outer_selects = [field.name for field in expression.selects]
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't seem right, what if field is not a column and is an expression?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, moreover if the original query had both table1.col and table2.col in the projections, this would produce two cols in the new one.

Copy link
Author

@richard-merrick richard-merrick Oct 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly under the current implementation this can only work where outer_selects are unique columns or *, and this isn't changed by this PR.

Currently this function takes an existing query, adds a row count window func to use for distincting, and converts it to a subquery aliased as _t.
The outer query then selects from this subquery where _row_number = 1
Selecting anything other than unique column names without table identifiers or * in the outer select won't resolve. eg:

SELECT [expression logic] AS column_alias 
FROM (
  SELECT 
    [expression logic] AS column_alias, 
    ROW_NUMBER() OVER (...) AS _row_count
  FROM table1
) AS "_t"
WHERE _row_count = 1

^ won't work as presumably the expression is related to the inner query tables, not the outer

SELECT table1.col, table2.col 
FROM (
  SELECT table1.col, table2.col, ROW_COUNT() OVER (...) as _row_count
  FROM table1 inner join table2 ...
) AS "_t" 
WHERE _row_count = 1

^ won't work as the reference to table1, table2 is invalid where the subquery is aliased as "_t"

This PR is solely intended to fix situations where the table identifier is added to the outer query selects, but is referencing the table identifiers from inside the subquery rather than the identifier of the subquery itself.

Obviously let me know if I am misunderstanding this or you think this requires a different solution.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can just do SELECT * instead for the outer query... Or, try to give unique (e.g. auto-generated) names to the inner projections as a best-effort approach, and then reference those in the outer query.

The SELECT * is probably the simplest approach, albeit introducing non-deterministic behavior, but maybe that's fine for transpilation purposes.

A more sophisticated approach could be to only project columns in the inner query, giving them unique aliases (preserving existing ones or those that have valid "output names" where applicable), and then reproducing whatever complex expression in the outer query by referencing those unique names.

So, if you had a projection t1.x + t2.x + 1, you'd project t1.x and t2.x, giving them unique aliases, and then in the outer query you'd reference them to reconstruct that sum.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But re: what happens today, you're probably right. The transformation seems a bit naive.

@tobymao
Copy link
Owner

tobymao commented Oct 24, 2024

can you write a test?

@georgesittas
Copy link
Collaborator

I'll go ahead and close this for now, I'm not sure if the approach is right. We probably wanna solve this problem more generically. Thanks for the contribution though, I'll take a look soon and ping you when there's a fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants