Skip to content

adds join_by #128

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft

adds join_by #128

wants to merge 5 commits into from

Conversation

drizk1
Copy link
Member

@drizk1 drizk1 commented Dec 18, 2024

I am not sure if this was the best way to do it, or if i should have removed the old syntax from the docstring, but this is one possible implementation (if join_by is not the move based on what happens with #126 i can close pr )

allows for all joins:

@inner_join(a, b, join_by(City==Location, Job==Work))
@inner_join(a, b, join_by("City"=="Location", "Job"=="Work"))
@inner_join(a, b, join_by(:City=>:Location))

@kdpsingh
Copy link
Member

Thanks @drizk1 for your continued work on this. I'm going to try to merge a bunch of PRs today.

drizk1 and others added 2 commits December 27, 2024 20:01
* fixes `@summary`

* Added support for non-numeric columns, minor tweaks to column names.

---------

Co-authored-by: Karandeep Singh <[email protected]>
* fix sep_rows for all string types

* fixes `@summary` (#124)

* fixes `@summary`

* Added support for non-numeric columns, minor tweaks to column names.

---------

Co-authored-by: Karandeep Singh <[email protected]>

---------

Co-authored-by: Karandeep Singh <[email protected]>
@kdpsingh
Copy link
Member

@drizk1, thanks for working on this. In R, the reason they introduced join_by() was to handle inequality joins because they couldn't support both the original syntax and the new syntax without breaking functionality. As an example of an inequality join, join_by() allowed you to do @inner_join(a, b, join_by(col_a > col_b)), which would join only those columns where col_a in data frame a is greater than col_b in data frame b.

In Julia, I don't think we have this same limitation. I think they we can support either syntax without needing to reach for join_by(). We can certainly add a join_by() to have similar syntax, but I think we should only do it if we can't achieve the same result without join_by.

Until we can support inequality joins (either using join_by or directly within the join macros), I don't see a good reason to merge this just yet. It is largely cosmetic and introduces a second way of doing things without expanding the current join capabilities. We should also be careful about where we use strings. I don't believe R supports "column_a" == "column_b" within join_by(), instead opting to stick with non-standard eval.

My ideal future is one that supports any of the following options, with the caveat in the below paragraph.

@inner_join(a, b, col_a = col_b)
@inner_join(a, b, col_a == col_b)
@inner_join(a, b, col_a > col_b)

In terms of how to implement inequality joins, FlexiJoins has a nice implementation that we could wrap. FlexiJoins has a tendency to duplicate column names, so another idea would be to use join_by() as a way to signal that we will be relying on the FlexiJoins implementation, which may behave slightly differently than pure DataFrames.jl.

In that case, our implementation could look like this:

@inner_join(a, b, col_a = col_b) # using DataFrames.jl
@inner_join(a, b, join_by(col_a == col_b)) # using FlexiJoins.jl
@inner_join(a, b, join_by(col_a > col_b)) # using FlexiJoins.jl

Open to thoughts.

@drizk1
Copy link
Member Author

drizk1 commented Dec 29, 2024

Ahh that makes sense then why they switched to join_by(). I dont have strong feelings either way. Generally speaking, I think the join_by syntax is a bit verbose and its definitely cleaner if we can avoid it, which I suspect we could.

Using join_by to delineate the use of flexijoins is interesting, but possibly cosmetic as well because i think the duplication of columns would actually still be in line with dplyr/tidyr behavior, where a non equi-join dupes the column name/keeps both columns in the comparisons.

> left_join(sales, promos, join_by(id == id))
# A tibble: 8 × 3
     id sale_date  promo_date
  <int> <date>     <date>    
1     1 2018-12-31 2019-01-01
2     1 2018-12-31 2019-01-05
3     1 2019-01-02 2019-01-01
4     1 2019-01-02 2019-01-05
5     1 2019-01-05 2019-01-01
6     1 2019-01-05 2019-01-05
7     2 2019-01-04 2019-01-01
8     2 2019-01-01 2019-01-01

> left_join(sales, promos, join_by( id > id))
# A tibble: 7 × 4
   id.x sale_date   id.y promo_date
  <int> <date>     <int> <date>    
1     1 2018-12-31    NA NA        
2     1 2019-01-02    NA NA        
3     1 2019-01-05    NA NA        
4     2 2019-01-04     1 2019-01-01
5     2 2019-01-04     1 2019-01-05
6     2 2019-01-01     1 2019-01-01
7     2 2019-01-01     1 2019-01-05

> left_join(sales, promos, join_by( sale_date >=  promo_date))
# A tibble: 10 × 4
    id.x sale_date   id.y promo_date
   <int> <date>     <int> <date>    
 1     1 2018-12-31    NA NA        
 2     1 2019-01-02     1 2019-01-01
 3     1 2019-01-02     2 2019-01-01
 4     1 2019-01-05     1 2019-01-01
 5     1 2019-01-05     1 2019-01-05
 6     1 2019-01-05     2 2019-01-01
 7     2 2019-01-04     1 2019-01-01
 8     2 2019-01-04     2 2019-01-01
 9     2 2019-01-01     1 2019-01-01
10     2 2019-01-01     2 2019-01-01

@kdpsingh
Copy link
Member

Ah, thanks for that. I think the one weird thing is that FlexiJoins duplicates even when you do id == id, though I guess this can be dealt with.

@drizk1 drizk1 marked this pull request as draft December 29, 2024 03:20
@drizk1
Copy link
Member Author

drizk1 commented Dec 29, 2024

no problem. the duped column names is currently my thorn in tidierdb.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants