-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[R] Convert arrow dictionary to R factor via as.data.frame.nanoarrow_array_stream()
?
#513
Comments
Thanks for bringing this up! One of the tricky things about dictionaries in Arrow is that the "levels"/"dictionary" live at the array level, not at the type level. This means that two arrays can be a You should be able to specify that you want a library(nanoarrow)
#> Warning: package 'nanoarrow' was built under R version 4.3.3
df1 <- data.frame(
x = as.factor(letters[1:5]),
y = as.factor(1:5)
)
df2 <- data.frame(
x = as.factor(letters[6:10]),
y = as.factor(1:5)
)
# Safest/most type stable/makes the fewest assumptions to just return
# the dictionary value type
basic_array_stream(list(df1, df2)) |>
convert_array_stream() |>
tibble::as_tibble()
#> # A tibble: 10 × 2
#> x y
#> <chr> <chr>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 1
#> 7 g 2
#> 8 h 3
#> 9 i 4
#> 10 j 5
# You can specify a factor() target type if you know the levels
basic_array_stream(list(df1, df2)) |>
convert_array_stream(
data.frame(x = factor(levels = letters), y = factor(levels = as.character(1:5)))
) |>
tibble::as_tibble()
#> # A tibble: 10 × 2
#> x y
#> <fct> <fct>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5
#> 6 f 1
#> 7 g 2
#> 8 h 3
#> 9 i 4
#> 10 j 5
# If you have only one batch, factor() should work as a target (but doesn't currently)
# You can specify a factor() target type if you know the levels
basic_array_stream(list(df1)) |>
convert_array_stream(
data.frame(x = factor(), y = factor())
) |>
tibble::as_tibble()
#> # A tibble: 5 × 2
#> x y
#> <fct> <fct>
#> 1 a 1
#> 2 b 2
#> 3 c 3
#> 4 d 4
#> 5 e 5 Created on 2024-06-09 with reprex v2.1.0 |
Thanks for the detailed explanation. I see, this is indeed a complicated process. Perhaps the statistics on the C interface that are currently being discussed could provide some sort of dictionary for the entire column...? |
I think that There is also a PR open to refactor the conversion process to make it easier to add these features: #392 |
Maybe related to #220
I noticed that if we convert nanoarrow_array_stream to data.frame, the dictionary becomes a character.
Created on 2024-06-09 with reprex v2.1.0
The text was updated successfully, but these errors were encountered: