add support for DISTINCT ON
#1620
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR is a first draft to address duckdb/duckdb-r#384, and add support for the usage of "DISTINCT ON" when using
distinct(..., .keep_all = TRUE)which is a SQL variant supported by PostgreSQL, and DuckDB, see e.g. https://duckdb.org/docs/stable/sql/query_syntax/select.html#distinct-on-clause. Currently,.keep_all = TRUEis implemented using window functions. UsingDISTINCT ONinstead promises a performance boost of that operation.I came to the conclusion that this cannot be addressed within an external dbplyr backend, but it requires a minor modification of the
lazy_select_querydata structure itself:Currently, a
lazy_select_querysupports adistinctstate, which can be eitherTRUEorFALSE(corresponding to a normalSELECTvs. aSELECT DISTINCT.The basic idea of this PR is to add a third state to the
distinctattribute which represents a list of columns that belong to theSELECT DISTICT ON (...)clause.Dbplyr backends can opt in to make use of
DISTINCT ONvia implementing a method of the new genericsupports_distinct_on()that returnsTRUE.Open issues
DISTINCT ONusesORDER BYto specify an ordering. That is also a reason whyORDER BYis allowed in subqueries in PostgreSQL and DuckDB. As far as I can see, dbplyr currently forbids theORDER BYstatement in subqueries. I did not investigate yet, if that can be modified easily, or even can be changed at all. In any case, from the user-perspectivewindow_order()would probably still be the right verb to specify the order.sql_clause()is not used correctly.distinctattribute that holds eitherTRUE,FALSEor a column list representation leads to a lot of required case distinction checks in the code, which are rather unpleasent to read and complicate the code. There is probably a way to model this in a more streamlined fashion.I would appreciate feedback to the open issues as well as the already existing code. It might not be the right approach after all.