adds `join_by` #128

drizk1 · 2024-12-18T01:24:18Z

I am not sure if this was the best way to do it, or if i should have removed the old syntax from the docstring, but this is one possible implementation (if join_by is not the move based on what happens with #126 i can close pr )

allows for all joins:

@inner_join(a, b, join_by(City==Location, Job==Work))
@inner_join(a, b, join_by("City"=="Location", "Job"=="Work"))
@inner_join(a, b, join_by(:City=>:Location))

kdpsingh · 2024-12-27T18:57:56Z

Thanks @drizk1 for your continued work on this. I'm going to try to merge a bunch of PRs today.

* fixes `@summary` * Added support for non-numeric columns, minor tweaks to column names. --------- Co-authored-by: Karandeep Singh <[email protected]>

* fix sep_rows for all string types * fixes `@summary` (#124) * fixes `@summary` * Added support for non-numeric columns, minor tweaks to column names. --------- Co-authored-by: Karandeep Singh <[email protected]> --------- Co-authored-by: Karandeep Singh <[email protected]>

kdpsingh · 2024-12-28T04:19:44Z

@drizk1, thanks for working on this. In R, the reason they introduced join_by() was to handle inequality joins because they couldn't support both the original syntax and the new syntax without breaking functionality. As an example of an inequality join, join_by() allowed you to do @inner_join(a, b, join_by(col_a > col_b)), which would join only those columns where col_a in data frame a is greater than col_b in data frame b.

In Julia, I don't think we have this same limitation. I think they we can support either syntax without needing to reach for join_by(). We can certainly add a join_by() to have similar syntax, but I think we should only do it if we can't achieve the same result without join_by.

Until we can support inequality joins (either using join_by or directly within the join macros), I don't see a good reason to merge this just yet. It is largely cosmetic and introduces a second way of doing things without expanding the current join capabilities. We should also be careful about where we use strings. I don't believe R supports "column_a" == "column_b" within join_by(), instead opting to stick with non-standard eval.

My ideal future is one that supports any of the following options, with the caveat in the below paragraph.

@inner_join(a, b, col_a = col_b)
@inner_join(a, b, col_a == col_b)
@inner_join(a, b, col_a > col_b)

In terms of how to implement inequality joins, FlexiJoins has a nice implementation that we could wrap. FlexiJoins has a tendency to duplicate column names, so another idea would be to use join_by() as a way to signal that we will be relying on the FlexiJoins implementation, which may behave slightly differently than pure DataFrames.jl.

In that case, our implementation could look like this:

@inner_join(a, b, col_a = col_b) # using DataFrames.jl
@inner_join(a, b, join_by(col_a == col_b)) # using FlexiJoins.jl
@inner_join(a, b, join_by(col_a > col_b)) # using FlexiJoins.jl

Open to thoughts.

drizk1 · 2024-12-29T00:53:51Z

Ahh that makes sense then why they switched to join_by(). I dont have strong feelings either way. Generally speaking, I think the join_by syntax is a bit verbose and its definitely cleaner if we can avoid it, which I suspect we could.

Using join_by to delineate the use of flexijoins is interesting, but possibly cosmetic as well because i think the duplication of columns would actually still be in line with dplyr/tidyr behavior, where a non equi-join dupes the column name/keeps both columns in the comparisons.

> left_join(sales, promos, join_by(id == id))
# A tibble: 8 × 3
     id sale_date  promo_date
  <int> <date>     <date>    
1     1 2018-12-31 2019-01-01
2     1 2018-12-31 2019-01-05
3     1 2019-01-02 2019-01-01
4     1 2019-01-02 2019-01-05
5     1 2019-01-05 2019-01-01
6     1 2019-01-05 2019-01-05
7     2 2019-01-04 2019-01-01
8     2 2019-01-01 2019-01-01

> left_join(sales, promos, join_by( id > id))
# A tibble: 7 × 4
   id.x sale_date   id.y promo_date
  <int> <date>     <int> <date>    
1     1 2018-12-31    NA NA        
2     1 2019-01-02    NA NA        
3     1 2019-01-05    NA NA        
4     2 2019-01-04     1 2019-01-01
5     2 2019-01-04     1 2019-01-05
6     2 2019-01-01     1 2019-01-01
7     2 2019-01-01     1 2019-01-05

> left_join(sales, promos, join_by( sale_date >=  promo_date))
# A tibble: 10 × 4
    id.x sale_date   id.y promo_date
   <int> <date>     <int> <date>    
 1     1 2018-12-31    NA NA        
 2     1 2019-01-02     1 2019-01-01
 3     1 2019-01-02     2 2019-01-01
 4     1 2019-01-05     1 2019-01-01
 5     1 2019-01-05     1 2019-01-05
 6     1 2019-01-05     2 2019-01-01
 7     2 2019-01-04     1 2019-01-01
 8     2 2019-01-04     2 2019-01-01
 9     2 2019-01-01     1 2019-01-01
10     2 2019-01-01     2 2019-01-01

kdpsingh · 2024-12-29T02:10:01Z

Ah, thanks for that. I think the one weird thing is that FlexiJoins duplicates even when you do id == id, though I guess this can be dealt with.

drizk1 · 2024-12-29T03:21:45Z

no problem. the duped column names is currently my thorn in tidierdb.

drizk1 added 3 commits December 17, 2024 20:03

adds join_by

226c7da

updates join_by to docstring desc

70f4595

add join_by(col) support when names same

746fcbe

drizk1 and others added 2 commits December 27, 2024 20:01

fixes @summary (#124)

cbc4394

* fixes `@summary` * Added support for non-numeric columns, minor tweaks to column names. --------- Co-authored-by: Karandeep Singh <[email protected]>

drizk1 marked this pull request as draft December 29, 2024 03:20

drizk1 mentioned this pull request Jan 7, 2025

allow for joins on mult columns, inequality joins, fix edge case with arrays TidierOrg/TidierDB.jl#91

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

adds `join_by` #128

adds `join_by` #128

drizk1 commented Dec 18, 2024 •

edited

Loading

kdpsingh commented Dec 27, 2024

kdpsingh commented Dec 28, 2024

drizk1 commented Dec 29, 2024 •

edited

Loading

kdpsingh commented Dec 29, 2024

drizk1 commented Dec 29, 2024

adds join_by #128

Are you sure you want to change the base?

adds join_by #128

Conversation

drizk1 commented Dec 18, 2024 • edited Loading

kdpsingh commented Dec 27, 2024

kdpsingh commented Dec 28, 2024

drizk1 commented Dec 29, 2024 • edited Loading

kdpsingh commented Dec 29, 2024

drizk1 commented Dec 29, 2024

adds `join_by` #128

adds `join_by` #128

drizk1 commented Dec 18, 2024 •

edited

Loading

drizk1 commented Dec 29, 2024 •

edited

Loading