Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement access to complex columns (ARRAY, MAP, STRUCT) #20

Open
henningsway opened this issue Feb 22, 2018 · 13 comments
Open

Implement access to complex columns (ARRAY, MAP, STRUCT) #20

henningsway opened this issue Feb 22, 2018 · 13 comments
Assignees

Comments

@henningsway
Copy link

henningsway commented Feb 22, 2018

As a beginner it is not immediatly clear to me, how to best use implyr to access Impala-complex types (especially maps, e.g. pull out a couple of columns and join them with the existing data-frame).

The link in the Readme is helpful (to create dbGetQuery()-requests), but a short example - possibly showing dplyr-logic - would be really cool as well. :)

@ianmcook
Copy link
Owner

Agreed, this would be useful, and I hope to do this soon. In most aspects the Impala SQL syntax is compatible with the HiveQL syntax used by Hive and Spark SQL, but Impala's syntax diverges when working with maps and arrays—Impala uses join syntax instead of lateral views. I think this does warrant an explanation and some examples in the readme, or possibly in a separate vignette about complex types.

@ianmcook ianmcook self-assigned this Feb 22, 2018
@henningsway
Copy link
Author

henningsway commented Feb 22, 2018

Thank you for your answer and this cool package, which definately eases my transition towards bigger data sets. :)

Do I understand correctly, that the currently suggested way to access complex types would be to execute SQL-queries (via dbGetQuery()) and then joining the result from this query (which extracted the complex type) with the result from a executed dplyr statement?

(I was hoping, that there would be some pipeable dplyr-frontend to extract some parts of a complex type, similar to map(df, "column") or tidyr::unnest or so.)

@ianmcook
Copy link
Owner

Currently for array and map columns, yes, I think that's right. I'm doing some experimental work right now to implement access to array and map columns in a dplyr-friendly, tidyr-style way. I'll let you know when I have something ready for you to try.

@henningsway
Copy link
Author

Sounds interesting and I will surely take that idea for a testdrive. Would be nice to stay within one system, even though the Impala SQL Syntax seems surprisingly accessible as well.

@ianmcook
Copy link
Owner

I implemented support for complex columns in the development version of implyr on GitHub. This is currently an experimental feature; the implementation is somewhat hacky and there are some limitations as described below.

@henningsway can you try this out on your data? First install the latest implyr from GitHub:
devtools::install_github("ianmcook/implyr")
Then use the impala_unnest() function, like this:
my_tbl_impala %>% impala_unnest(complex_column)

Current limitations:

  • You can use dplyr verbs after impala_unnest() but not before.
  • ARRAY, MAP, and STRUCT columns are all supported, but currently it only works with one complex column per table.
  • It currently only works with complex columns that contain scalar types, not complex types nested within complex types (like an ARRAY of STRUCT).
  • There are no examples showing how to use this function in the README yet.

@henningsway
Copy link
Author

henningsway commented Feb 28, 2018

This sounds exciting. I will try it soon and let you know.
Unfortunately, I already know, that I will encounter recurring nested columns (usually two levels) regularly. Let me see, what's already possible with the current tooling.
Out of curiosity, what kind of complexity is introduced by multiple levels of complex columns?

@ianmcook
Copy link
Owner

ianmcook commented Mar 2, 2018

I plan to implement support for arbitrarily many nested levels of complex types, but I wanted to solve the simpler case first. There are some complications and design choices I need to consider more; for example, should a single call to the impala_unnest() function perform recursive unnesting, or should the user need to call the function multiple times?

@henningsway
Copy link
Author

This is an interesting question. First I thought, that repeated calling of imapala_unnest() would be my preference, as it seemed clearer to me. But being able to extract a repeatedly nested "layer" in one line of code also seems attractive!

@henningsway
Copy link
Author

henningsway commented Mar 15, 2018

I've tried to get the unnesting working for me, but I haven't yet been successful.

This is the kind of code I tried

sometable_tbl <- tbl(impala, in_schema("my_schema", "sometable"))
sometable_tbl %>% 
  impala_unnest(nameofcomplexcolumn)

This is the kindo of error I got

Error in new_result(connection@ptr, statement) : nanodbc/nanodbc.cpp:1344: HY000:
[Cloudera][ImpalaODBC] (110) Error while executing a query in Impala: [HY000] :
AnalysisException: Could not resolve table reference: 'my_schema.sometable'

I noticed, that the complex column is not represented in sometable_tbl, when I preview it in Rstudio or collect the first 1000 lines or so.

PS: I think it may have to do with the complex column having a complex column nested within itsself. I will try to find another example to work on soon and get back to you.

@ianmcook
Copy link
Owner

Thanks for testing this! There was a bug in how column names were quoted. I just fixed this in 21a8557. This should resolve the error you observed. But I have not yet implemented support for multiple levels of nesting (complex columns within complex columns). Hoping to do that soon.

@henningsway
Copy link
Author

henningsway commented May 15, 2018

Just started using sparklyr a little more. The functionality from https://github.com/mitre/sparklyr.nested seem quite interesting and possibly relevant for implyr as well?

@ianmcook
Copy link
Owner

TBD: Look at tidyverse/dbplyr#158 and test with Impala

@ianmcook ianmcook changed the title Include implyr-example for complex types (e.g. map) in Readme Implement access to complex columns (ARRAY, MAP, STRUCT) Jul 12, 2019
@ianmcook
Copy link
Owner

TBD: Investigate if this can be redesigned for consistency with tidyr's new unnest_wider() and unnest_longer() functions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants