fix: Always use `get_src_tbl_names()` to fetch tbl `Id`s within `dm_from_con()` #1934

owenjonesuob · 2023-07-24T10:52:13Z

According to the docs of dm_from_con(), the table_names parameter allows the user to include only specific tables in the returned dm object:

#' @param table_names
#'   A character vector of the names of the tables to include.

Currently, we use two different methods to fetch tbls:

If table_names was not provided, use dm:::get_src_tbl_names() to fetch tbl Ids, which are later passed to tbl()
If table_names was provided, pass the character-type names directly to tbl()

But the second method does not work with non-default schema names, or with multiple schemas; and we also create additional problems when we try to set names for the list of tbls (#1933).

I think we can actually use the first method in all cases, i.e. use get_src_tbl_names() to return a named list of tbl Ids.

Then we can use table_names to take a subset from that list, before passing to tbl().

Fixes #1933.

TSchiefer

This looks very good to me, thank you for your contribution!

I tried using this for SQL Server, trying to access tables in schemas, making use of the schema argument passed on in the ellipsis. This did not work, therefore I rewrote some of your code slightly, cf. comment.
Also, I think the function should fail in case specifically requested tables cannot be found, which is how it was supposed to behave before. I also changed that in the suggested code.
Does this make sense to you, could you please adapt?

TSchiefer · 2023-08-03T14:54:45Z

R/dm_from_con.R

-    set_names(src_tbl_names, nms) %>%
-    quote_ids(con) %>%
-    map(possibly(tbl, NULL), src = src)
+  tbls <- map(tbl_ids, possibly(tbl, NULL), src = src)


When using schemas this code fails as well, since table_names normally does not include the schema, yet names(tbl_ids) does.
Therefore, I would suggest to replace lines 96-109 with the following:

tbl_ids <- get_src_tbl_names(src, ..., names = .names) # Fetch only the tbls which were specifically requested if (!is.null(table_names)) { nms <- purrr::map_chr(tbl_ids, ~ .x@name[["table"]]) bad_table_names <- table_names[!(table_names %in% nms)] tbl_ids <- tbl_ids[which(nms %in% table_names)] } tbls <- map(tbl_ids, possibly(tbl, NULL), src = src) bad <- map_lgl(tbls, is_null) if (any(bad) || (exists("bad_table_names", inherits = FALSE) && length(bad_table_names) > 0)) { if (is_null(table_names)) { warn_tbl_access(names(tbls)[bad]) tbls <- tbls[!bad] } else { abort_tbl_access(union(names(tbls)[bad], bad_table_names)) } }

Many thanks for this!! I hadn't spotted the *_tbl_access() functions, those look super handy 😁 I'll certainly adjust to include those.

Regarding table_names, I've just been thinking about the following situation:

Suppose there is a table name which exists in multiple schemas, e.g. suppose schema1.table1 and schema2.table1 both exist.

Now if a user asks for:

dm_from_con(con, table_names = "table1", schemas = c("schema1", "schema2"))

What should we return in such a case?

One option would be for the user to provide "schema.table"-format names in table_names. The documentation for table_names would perhaps need some adjustment - but:

If a user has requested multiple schemas, then we can assume the user is already aware of the schema.table syntax - see discussion in comment here feat: dm_from_con() gains .names argument for pattern-based construction of table names in the dm object #1790 (comment)

There is zero ambiguity if names are specified like this

I think it works with the current names(tbl_ids) implementation?

Or maybe we should include both schema1.table1 and schema2.table1 in the returned dm object; which I think is what your suggested changes will result in? I'm conflicted about this, because I feel like once we start dealing with multiple schemas, we should always be explicit about those schemas - but maybe that's just me being too strict with SQL ideals 😛

To offer one more alternative, we could include only the first table found with that name, and then raising a warning, as described here? #1789 (comment)

I'll happily proceed with whichever you think is best!

Thanks for your reply, good catch about the same table name in different schemas.

In my opinion, if the user provides 1 table name and more than 1 schema name, as in your example, this should lead to an error: either schema is not provided, has length 1 or needs to have a length matching the length of table_names.

names(tbl_ids) is a vector of the resulting names in the dm according to the specification in the .names argument. In case schema is given with multiple schema names, the default for this argument -- as you know -- is "{.schema}.{.table}", which leads to dm-tables just like you suggested (schema1.table1 and schema2.table1), if the user gave table_names = c("table1", "table1") and schema = c("schema1", "schema2").
In general names(tbl_ids) could be anything though and for this reason it cannot be used in any conditions.

I think the error messages need to be adapted to include the schema names of the tables that could not be accessed.

Okidoki! So based on these comments so far, I think we might need to modify how table_names works, such that it can accept a list in the multiple-schemas case - something like:

# No `schema` arg => use default schema, i.e. same as one-schema case (next example) dm_from_con(con, table_names = c("table1", "table2")) # One specified schema => `table_names` is a single vector of names # Returned dm object contains: schema1.table1, schema1.table2 dm_from_con(con, schema = "schema1", table_names = c("table1", "table2")) # Multiple schemas => `table_names` is a list of vectors (must it be named?) # Returned dm object contains: schema1.table1, schema1.table2, schema2.table1, schema2.table3 dm_from_con( con, schema = c("schema1", "schema2"), table_names = list( "schema1" = c("table1", "table2"), "schema2" = c("table1", "table3") ) )

I'll make one more case for leaving things as they are currently implemented, i.e. based on names(tbl_ids), because it avoids adding complexity to table_names.

In general names(tbl_ids) could be anything though

True - but names are always constructed using the .names pattern, which is either the default (which multi-schema users must be aware of already), or specified by the user!

So the user does always know how elements of the returned dm object will be named; and therefore they know how the strings in table_names should look.

# Default schema / one schema dm_from_con(con, table_names = c("table1", "table2")) dm_from_con(con, schema = "schema1", table_names = c("table1", "table2")) # Multiple schemas, default `.names` dm_from_con( con, schemas = c("schema1", "schema2"), table_names = c("schema1.table1", "schema1.table2", "schema2.table1", "schema2.table3") ) # Multiple schemas, custom `.names` - user knows that `table_names` should match `.names` dm_from_con( con, schemas = c("schema1", "schema2"), # The user chooses `.names`... .names = "{.schema}__{.table}", # ... so, they know what `table_names` should look like! table_names = c("schema1__table1", "schema1__table2", "schema2__table1", "schema2__table3") )

Thinking back to before multiple schema functionality was added, I think one interpretation of table_names was:

"I only want the returned dm object to contain elements with names like I've specified in table_names."

And I think that with this implementation, that interpretation is still valid now in the multi-schema case.

What do you think? For either of those approaches, I reckon I now have a solid specification to go away and implement 😇 many thanks for your patience and guidance with this one!

krlmlr · 2023-08-14T13:24:13Z

Thanks for your contribution! What's the next action here? Can we add a test for this change?

krlmlr · 2023-08-14T15:36:08Z

I think we want the following:

Try dm_meta(simple = FALSE), works for select databases and allows learning keys
If this fails, try dm_meta(simple = TRUE), this should work for all databases (we need to do some work to support SQLite)

This gets rid of a lot of legacy code and simplifies things. Let me tinker with that a bit, no action needed here for now.

owenjonesuob added 2 commits July 24, 2023 10:42

Use get_src_tbl_names to fetch Ids in all cases

9baf213

Warn if specified table is not available

315ecc2

jesse-smith mentioned this pull request Aug 2, 2023

Error when supplying table_names in dm_from_con() #1936

Open

TSchiefer reviewed Aug 3, 2023

View reviewed changes

krlmlr marked this pull request as draft August 15, 2023 04:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Always use `get_src_tbl_names()` to fetch tbl `Id`s within `dm_from_con()` #1934

fix: Always use `get_src_tbl_names()` to fetch tbl `Id`s within `dm_from_con()` #1934

owenjonesuob commented Jul 24, 2023

TSchiefer left a comment

TSchiefer Aug 3, 2023

owenjonesuob Aug 7, 2023

TSchiefer Aug 8, 2023 •

edited

Loading

owenjonesuob Aug 11, 2023

krlmlr commented Aug 14, 2023

krlmlr commented Aug 14, 2023

fix: Always use get_src_tbl_names() to fetch tbl Ids within dm_from_con() #1934

Are you sure you want to change the base?

fix: Always use get_src_tbl_names() to fetch tbl Ids within dm_from_con() #1934

Conversation

owenjonesuob commented Jul 24, 2023

TSchiefer left a comment

Choose a reason for hiding this comment

TSchiefer Aug 3, 2023

Choose a reason for hiding this comment

owenjonesuob Aug 7, 2023

Choose a reason for hiding this comment

TSchiefer Aug 8, 2023 • edited Loading

Choose a reason for hiding this comment

owenjonesuob Aug 11, 2023

Choose a reason for hiding this comment

krlmlr commented Aug 14, 2023

krlmlr commented Aug 14, 2023

fix: Always use `get_src_tbl_names()` to fetch tbl `Id`s within `dm_from_con()` #1934

fix: Always use `get_src_tbl_names()` to fetch tbl `Id`s within `dm_from_con()` #1934

TSchiefer Aug 8, 2023 •

edited

Loading