feat: support new metrics firehose api with `get_usage()` #404

toph-allen · 2025-05-02T22:40:31Z

Intent

Adds support for the metrics "firehose" API.

Fixes #390

Approach

Mostly straightforward and following existing patterns in connectapi, with a few strongly-opinionated choices that I hold loosely and would be open to pushback on.

The endpoint accepts from and to parameters. If these are timestamps, they are used as-is. If dates (with no specified times) are passed, from is treated as "start of day" and to is treated as "end of day". I think this will match user expectations — imagine selecting a start and an end date in a date range picker.
The data returned from the endpoint includes path and user_agent fields nested under a data field. Without special treatment these are returned a list-column, which is awkward. I initially experimented with tidyr::unnest(), but that was slow on the larger datasets returned by this endpoint, so I wrote a custom fast_unnest_character() function which runs in about 5% (!) of the time that tidyr::unnest() takes.
- I think that the existing parse_connectapi() function could see drastic performance improvements this way too, and those would directly speed up the performance of things like like the megadashboard. I have been meaning to look into this for a minute (investigate parse_connectapi_typed performance #383).
I made another change related to fix: minor improvements to time zone parsing #400 — I found that the way that connectapi was constructing NA_datetime_ combined with its vctrs-based parsing approach was forcefully converting all timestamps to UTC regardless of other inner functions.
Removed the cyclocomp linter. R6 classes cause weirdness with it, and to work around this, its complexity limit had been set high enough that the Connect class just now inched over the limit. Not worth even having it enabled at that point.

Checklist

Does this change update NEWS.md (referencing the connected issue if necessary)?
Does this change need documentation? Have you run devtools::document()?

# Conflicts: # NEWS.md

.lintr

jonkeane · 2025-05-05T15:48:58Z

R/connect.R

+    #' @description Get content usage data.
+    #' @param from Optional `Date` or `POSIXt`; start of the time window. If a
+    #' `Date`, coerced to `YYYY-MM-DDT00:00:00` in the caller's time zone.
+    #' @param to Optional `Date` or `POSIXt`; end of the time window. If a
+    #' `Date`, coerced to `YYYY-MM-DDT23:59:59` in the caller's time zone.


I'm not sure this is doing what we expect with timezones. And actually, I'm not even totally sure what is intended. So maybe we should work that out here and then adapt the code to fit? When we send timestamps to Connect with this function do we want them to be transformed to UTC from the caller's local timezone before being sent? Or some other behavior?

One thing to note: If I'm reading this correctly, make_timestamp() has slightly different behavior if one sends a non-character than if one sends a character input. Where the character string version will not be parsed and also not be transformed into UTC. And so this function will do the same.

connectapi/R/parse.R

Lines 16 to 28 in e8c8075

make_timestamp <- function(input) {

if (is.character(input)) {

# TODO: make sure this is the right timestamp format

return(input)

}

# In the call to `safe_format`:

# - The format specifier adds a literal "Z" to the end of the timestamp, which

# tells Connect "This is UTC".

# - The `tz` argument tells R to produce times in the UTC time zone.

# - The `usetz` argument says "Don't concatenate ' UTC' to the end of the string".

safe_format(input, "%Y-%m-%dT%H:%M:%SZ", tz = "UTC", usetz = FALSE)

}

We discussed this the libraries guild. We discussed the tradeoffs between (1) having time-less dates be processed in the timezone of the server, (2) processing them in UTC, or (3) requiring that the caller specify a time.

We could see arguments for all sides, but I think as a group we leaned towards (2) or (3) — if we allow bare dates, process in UTC; perhaps just require timestamps and force the user to write the date-selection part of the code. That last option might make more sense, as converting a date to a timestamp should be closer to the UI.

@karawoo / @tdstein / @marcosnav can correct me if I got anything wrong here.

I removed special handling from this function and deferred to make_timestamp(), like all the other from and to arguments in connectapi. That's not to say that that function handles things correctly, just that this approach avoids trying to shoehorn too much into this one pull request.

R/parse.R

jonkeane · 2025-05-05T15:51:26Z

tests/testthat/test-get.R

@@ -365,3 +374,86 @@ test_that("get_content only requests vanity URLs for Connect 2024.06.0 and up",
    )
  })
 })
+
+test_that("get_usage() returns usage data in the expected shape", {


Could we add a test where we pass in a character that is a formatted timestamp? (if we are intending to support that, of course)

I removed any custom time zone behavior from this pull request — now the code leans on make_timestamp(), similar to other functions in the package. Accordingly, I don't think we need to test this behavior now. :)

jonkeane · 2025-05-05T22:42:40Z

R/parse.R

+fast_unnest_character <- function(df, col_name) {
+  if (!is.character(col_name)) {
+    stop("col_name must be a character vector")
+  }
+  if (!col_name %in% names(df)) {
+    stop("col_name is not present in df")
+  }
+
+  list_col <- df[[col_name]]
+
+  new_cols <- names(list_col[[1]])
+
+  df2 <- df
+  for (col in new_cols) {
+    df2[[col]] <- vapply(
+      list_col,
+      function(row) {
+        if (is.null(row[[col]])) {
+          NA_character_
+        } else {
+          row[[col]]
+        }
+      },
+      "1",
+      USE.NAMES = FALSE
+    )
+  }
+
+  df2[[col_name]] <- NULL
+  df2
+}


The data returned from the endpoint includes path and user_agent fields nested under a data field. Without special treatment these are returned a list-column, which is awkward. I initially experimented with tidyr::unnest(), but that was slow on the larger datasets returned by this endpoint, so I wrote a custom fast_unnest_character() function which runs in about 5% (!) of the time that tidyr::unnest() takes.

Thinking about this a bit more: this isn't a huge chunk of code of course, but it is another chunk that we will take on the maintenance of if we go this route. This is another example where having our data interchange within connectapi be all data frames means we have to worry about the performance of json-parsed list responses into data frames and make sure those data frames are in a natural structure for folks to use. If we relied instead on only the parsed list data as our interchange and then gave folks as.data.frame() methods, we could defer the (sometimes expense) reshaping until late in process and eaking out performance gains like this are much less important so we can rely on more off the shelf tools.

I can see what you mean, and this does align with what you've been saying about other server objects.

I hear what you're saying about parsing to data frames for data interchange and I think that approach would be great to use for, say, the integrations endpoints that I just added stories for.

For the data from the hits endpoint, presenting it as anything other than a data frame goes back to feeling kinda not-R-idiomatic, as it isn't data that can… hmm…

So definitely one of the tasks, and maybe the main task that I can imagine for this data is to, like, treat it as a data frame and filter, plot, etc., it. But another thing you might want to do is, like, get the content item associated with this hit. And yeah, in that case, you might just want to be able to pass the hit, or hit$content_guid to content_item().

I still think we might want to keep code like this around in an as.data.frame() method — for making an unnested data frame out of nested data from the Connect API, tidyr::unnest() was 20x slower (which I was surprised by! it seems like a wild differential).

Open to a variety of options — let's discuss what the best approach would be to finalize and merge this PR.

Discussed this in the Libraries Guild meeting. Some takeaways:

Whether we parse the endpoint's response to a data frame in the get_usage() function or in an as.data.frame() method for that function's response, we want that code to be performant.

I could profile tidyr::unnest() to understand why it's slow, and explore options (e.g. its ptype argument) to see if they speed it up, which would remove the need to have this chunk of code to maintain.

Calling tidyr::unnest_wider(usage, data, ptype = list(path = character(0), user_agent = character(0))) is much more performant — that would be reasonable. But I don't really want to add a tidyr dependency to connectapi, I realize — we've been trying to remove dependencies. I have conflicting feelings on how to proceed here and I'd appreciate anyone's input.

I think these are the approaches open to us, in order of… least to most code in the package:

Don't convert to a tibble. I know we want to move in this direction with other connectapi classes, but I do feel like usage data will almost always be addressed in tabular form, so I think it's nicer convert to a data frame in this function.

Don't unnest the data column. This is reasonable, but similar to above, I feel like… it's probably nicer to unnest it, given that structurally it just… contains additional data fields for each hit.

Unnest using tidyr::unnest(). I feel like we should discount this one because we don't want to add that as a dependency.

Unnest using custom function. The problem with this is that it relies on us adding a custom unnest function to the package, which is more surface area to maintain / test.

nealrichardson · 2025-05-08T18:23:59Z

R/connect.R

+    #' `Date`, coerced to `YYYY-MM-DDT00:00:00` in the caller's time zone.
+    #' @param to Optional `Date` or `POSIXt`; end of the time window. If a
+    #' `Date`, coerced to `YYYY-MM-DDT23:59:59` in the caller's time zone.
+    inst_content_hits = function(from = NULL, to = NULL) {


I don't think this method name benefits from having the inst_ prefix on it.

More of a philosophical musing, but I also am not sure it's worth packing more things into methods on the Connect R6 object. get_usage() is a pretty thin wrapper around this. We could just put this inside of get_usage(), not sure there's much benefit to separating things like this. I know this is a pattern in parts of the package, but that doesn't mean we have to keep doing it.

Agreed that I don't think the method name at all benefits from inst_ prefix; to the reader, it isn't at all clear what it means.

I would also be completely fine with beginning to move everything into the functions; I agree that I don't think we get much benefit at all from having the logic in the R6 object. In fact, I think it adds confusion: it was hard for me to decide where we want different bits of logic to live, i.e. within the method or the outer function. It's kind arbitrary.

I guess you could argue that the goal of the methods is to enumerate the available endpoints that operate on different objects, but it feels like a pretty verbose way to do that.

I merged all of the implementation from the Connect$inst_content_hits method into the get_usage() function, and deleted the method.

R/get.R

R/parse.R

tests/testthat/test-get.R

nealrichardson

A couple of minor suggestions but otherwise 🚀

R/get.R

R/parse.R

R/get.R

Co-authored-by: Neal Richardson <[email protected]>

toph-allen added 7 commits April 29, 2025 11:29

initial implementation of hits endpoint

99e5a10

use client method

1e6a816

Merge branch 'main' into toph/metrics-firehose

a06bb56

# Conflicts: # NEWS.md

complete get_usage functionality

437803b

changes to allow time zone propagation

19e45cd

add tests, fix bugs

66e61ef

fix doc problem

24f9dc3

toph-allen force-pushed the toph/metrics-firehose branch from 1609a9b to 24f9dc3 Compare May 2, 2025 22:47

toph-allen added 2 commits May 2, 2025 19:14

make lintr happier

6442b4d

remove cyclocomp linter

919bb0a

toph-allen requested review from jonkeane and nealrichardson May 2, 2025 23:24

toph-allen marked this pull request as ready for review May 2, 2025 23:24

jonkeane reviewed May 5, 2025

View reviewed changes

.lintr Show resolved Hide resolved

jonkeane reviewed May 5, 2025

View reviewed changes

R/parse.R Outdated Show resolved Hide resolved

jonkeane reviewed May 5, 2025

View reviewed changes

nealrichardson reviewed May 8, 2025

View reviewed changes

toph-allen added 2 commits July 15, 2025 17:49

respond to comments

7b05c73

edit comment

01a6ccd

nealrichardson approved these changes Jul 16, 2025

View reviewed changes

R/get.R Show resolved Hide resolved

R/get.R Outdated Show resolved Hide resolved

R/get.R Outdated Show resolved Hide resolved

R/parse.R Outdated Show resolved Hide resolved

R/parse.R Outdated Show resolved Hide resolved

R/get.R Show resolved Hide resolved

toph-allen and others added 4 commits July 17, 2025 18:01

Apply suggestion from @nealrichardson

567fea3

Co-authored-by: Neal Richardson <[email protected]>

Apply suggestion from @nealrichardson

f68f753

Co-authored-by: Neal Richardson <[email protected]>

Apply suggestion from @nealrichardson

26980e3

Co-authored-by: Neal Richardson <[email protected]>

Apply suggestion from @nealrichardson

491f7df

Co-authored-by: Neal Richardson <[email protected]>

toph-allen mentioned this pull request Jul 17, 2025

Consider using fast_unnest_character() to unnest list columns #424

Open

toph-allen added 2 commits July 17, 2025 19:14

update documentation

b474ed9

Merge branch 'main' into toph/metrics-firehose

d2ae1a0

	make_timestamp <- function(input) {
	if (is.character(input)) {
	# TODO: make sure this is the right timestamp format
	return(input)
	}

	# In the call to `safe_format`:
	# - The format specifier adds a literal "Z" to the end of the timestamp, which
	# tells Connect "This is UTC".
	# - The `tz` argument tells R to produce times in the UTC time zone.
	# - The `usetz` argument says "Don't concatenate ' UTC' to the end of the string".
	safe_format(input, "%Y-%m-%dT%H:%M:%SZ", tz = "UTC", usetz = FALSE)
	}

feat: support new metrics firehose api with get_usage() #404

Are you sure you want to change the base?

feat: support new metrics firehose api with get_usage() #404

Uh oh!

Conversation

toph-allen commented May 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Intent

Approach

Checklist

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toph-allen Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toph-allen Jul 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

toph-allen May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nealrichardson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

feat: support new metrics firehose api with `get_usage()` #404

feat: support new metrics firehose api with `get_usage()` #404

toph-allen commented May 2, 2025 •

edited

Loading

toph-allen Jul 15, 2025 •

edited

Loading

toph-allen Jul 15, 2025 •

edited

Loading

toph-allen May 19, 2025 •

edited

Loading