Skip to content

Conversation

@lschneiderbauer
Copy link

This PR is a first draft to address duckdb/duckdb-r#384, and add support for the usage of "DISTINCT ON" when using distinct(..., .keep_all = TRUE) which is a SQL variant supported by PostgreSQL, and DuckDB, see e.g. https://duckdb.org/docs/stable/sql/query_syntax/select.html#distinct-on-clause. Currently, .keep_all = TRUE is implemented using window functions. Using DISTINCT ON instead promises a performance boost of that operation.

I came to the conclusion that this cannot be addressed within an external dbplyr backend, but it requires a minor modification of the lazy_select_query data structure itself:
Currently, a lazy_select_query supports a distinct state, which can be either TRUE or FALSE (corresponding to a normal SELECT vs. a SELECT DISTINCT.
The basic idea of this PR is to add a third state to the distinct attribute which represents a list of columns that belong to the SELECT DISTICT ON (...) clause.
Dbplyr backends can opt in to make use of DISTINCT ON via implementing a method of the new generic supports_distinct_on() that returns TRUE.

Open issues

  1. A major issue is the handling of the order specification. DISTINCT ON uses ORDER BY to specify an ordering. That is also a reason why ORDER BY is allowed in subqueries in PostgreSQL and DuckDB. As far as I can see, dbplyr currently forbids the ORDER BY statement in subqueries. I did not investigate yet, if that can be modified easily, or even can be changed at all. In any case, from the user-perspective window_order() would probably still be the right verb to specify the order.
  2. The syntax highlighting of the currently generated SQL code is incorrect, that's probably because sql_clause() is not used correctly.
  3. Having one distinct attribute that holds either TRUE, FALSE or a column list representation leads to a lot of required case distinction checks in the code, which are rather unpleasent to read and complicate the code. There is probably a way to model this in a more streamlined fashion.

I would appreciate feedback to the open issues as well as the already existing code. It might not be the right approach after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant