Performance of streaming requests #704

Aariq · 2025-03-13T23:32:16Z

While refactoring the rnpn package to use httr2, I've discovered that streaming ndjson with req_perform_connection() and resp_stream_lines() takes more time and significantly memory compared to using curl + jsonlite::stream_in()—so much so that I'm going to have to revert the change as users are running up against memory limitations. I'm not sure if this is just because of additional overhead due to features of httr2 or if it is something that can be addressed (or possibly I'm doing things wrong!)

For example, a request that uses ~17MB of RAM with curl + jsonlite::stream_in() uses ~1GB of RAM with httr2.

Full benchmark code:

library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)

url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01", 
              end_date = "2025-12-31")

bench::mark(
  httr2 = {
    req <- httr2::request(url) %>%
      httr2::req_method("POST") %>%
      httr2::req_body_form(!!!query)
    
    con <- httr2::req_perform_connection(req)
    out_httr2 <- tibble::tibble()
    
    while(!httr2::resp_stream_is_complete(con)) {
      resp <- httr2::resp_stream_lines(con, lines = 5000)
      df <- resp %>% 
        textConnection() %>% 
        jsonlite::stream_in(verbose = FALSE, pagesize = 5000)
      out_httr2 <- dplyr::bind_rows(out_httr2, df)
    }
    close(con)
    out_httr2
  },
  
  curl = {
    query2 <- c(query, customrequest = "POST")
    h <- new_handle() %>% handle_setform(.list = query2)
    
    con <- curl(url, handle = h)
    out_curl <- tibble::tibble()
    
    jsonlite::stream_in(con, function(df) {
      #I know this isn't necessary, but in the real code data wrangling happens
      #in the callback function
      out_curl <<- dplyr::bind_rows(out_curl, df) 
    }, verbose = FALSE, pagesize = 5000)
    out_curl
  }
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 httr2         14.6s    14.6s    0.0683    1.04GB   1.37  
#> 2 curl          14.6s    14.6s    0.0687   17.79MB   0.0687

^{Created on 2025-03-13 with reprex v2.1.1}

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.3 (2025-02-28)
#>  os       macOS Sequoia 15.3.2
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Phoenix
#>  date     2025-03-13
#>  pandoc   3.6.2 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bench       * 1.1.4   2025-01-16 [1] CRAN (R 4.4.1)
#>  cli           3.6.4   2025-02-13 [1] CRAN (R 4.4.1)
#>  curl        * 6.2.1   2025-02-19 [1] CRAN (R 4.4.1)
#>  digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.1)
#>  dplyr         1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
#>  evaluate      1.0.3   2025-01-10 [1] CRAN (R 4.4.1)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>  fs            1.6.5   2024-10-30 [1] CRAN (R 4.4.1)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
#>  glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>  httr2       * 1.1.1   2025-03-08 [1] CRAN (R 4.4.1)
#>  jsonlite    * 1.8.9   2024-09-20 [1] CRAN (R 4.4.1)
#>  knitr         1.49    2024-11-08 [1] CRAN (R 4.4.1)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
#>  pillar        1.10.1  2025-01-07 [1] CRAN (R 4.4.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
#>  profmem       0.6.0   2020-12-13 [1] CRAN (R 4.4.0)
#>  R6            2.6.1   2025-02-15 [1] CRAN (R 4.4.1)
#>  rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.4.0)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.0)
#>  rlang         1.1.5   2025-01-17 [1] CRAN (R 4.4.1)
#>  rmarkdown     2.29    2024-11-04 [1] CRAN (R 4.4.1)
#>  rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
#>  withr         3.0.2   2024-10-28 [1] CRAN (R 4.4.1)
#>  xfun          0.50    2025-01-07 [1] CRAN (R 4.4.1)
#>  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.0)
#> 
#>  [1] /Users/ericscott/Library/R/x86_64/4.4/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

The text was updated successfully, but these errors were encountered:

hadley · 2025-03-13T23:45:38Z

Can you give me a bit more of a realistic use case? It doesn't seem like you get any benefit from streaming here, since you download and process every single line.

hadley · 2025-03-13T23:52:42Z

Hmmmm, maybe the different meanings of streaming in httr2 and curl are confusing here. I don't think your use case benefits from httr2 streaming.

It is a bit weird that this allocates so much memory though:

library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)

url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01", 
              end_date = "2025-12-31")

req <- httr2::request(url) %>%
  httr2::req_method("POST") %>%
  httr2::req_body_form(!!!query)

bench::mark(
  httr2 = {
    con <- httr2::req_perform_connection(req)

    while(!httr2::resp_stream_is_complete(con)) {
      resp <- httr2::resp_stream_lines(con, lines = 5000)
    }
    close(con)    
  }
)

Aariq · 2025-03-14T00:00:06Z

The function I'm refactoring downloads potentially 300,000+ rows, does quite a bit of data wrangling to each chunk, and optionally writes each chunk to a file rather than rowbinding to an in-memory dataframe. Now all by the smallest queries seem to be having issues.

Users are now running into errors that seem to be due to running out of memory with requests that previously worked without writing to a file.

I didn't realize "streaming" had a different meaning here. I'm just hoping to get the ndjson a chunk at a time so I can wrangle it and optionally write it to a file. There might be a better approach though—I think I could only use streaming if an output file is specified and otherwise just read it all in one go, but I'm not sure that would solve the memory issue.

hadley · 2025-03-14T21:16:35Z

@Aariq thanks for the context, I'll take a look when I'm back from vacation. FWIW I'd highly recommend that you don't do iterative rowbinding as this is likely to be slow and cause a lot of memory allocations.

hadley · 2025-03-24T22:03:06Z

Some more code to help me understand what's going on:

library(httr2)

url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
  req_body_form(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
system.time(resp <- req_perform(req))
length(strsplit(resp_body_string(resp), "\n")[[1]])

stream_data <- function(req, lines) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_lines(con, lines = lines)
  }

  invisible()
}
batch_data <- function(req) {
  resp <- req_perform(req)
  resp_body_string(resp)
  invisible()
}

bench::mark(
  stream_data(req, 10),
  stream_data(req, 100),
  stream_data(req, 1000),
  batch_data(req),
  iterations = 1,
  filter_gc = FALSE,
  check = FALSE
)[1:5]
#> # A tibble: 4 × 5
#> expression                  min   median `itr/sec` mem_alloc
#> <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data(req, 10)       2.8s     2.8s     0.357      60MB
#> 2 stream_data(req, 100)     2.86s    2.86s     0.349    59.8MB
#> 3 stream_data(req, 1000)    2.82s    2.82s     0.354    61.1MB
#> 4 download_data(req)        2.56s    2.56s     0.390   423.6KB

So even with a smaller example, seeing a lot more memory churn going on. The churn doesn't seem to affect the overall speed that much and seems independent of the chunk size.

If I do some memory profilling with profvis:

profvis::profvis(stream_data(req, 100))

All of the allocation seems to be happening in readBin(), which I'm frankly surprised by because I wouldn't have thought that would allocate in R at all?

hadley · 2025-03-24T22:14:47Z

Ok, if I rewrite this in pure curl, I see the same memory allocation:

library(curl)

stream_data <- function() {
  url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
  body_fields <- c(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
  body <- charToRaw(paste0(paste0(names(body_fields), "=", body_fields), collapse = "&"))

  h <- new_handle()
  handle_setopt(h, post = TRUE, postfieldsize = length(body), postfields = body)
  
  con <- curl(url, handle = h)
  open(con, "rbf", blocking = FALSE)
  on.exit(con)
  
  while(isIncomplete(con)) {
    readBin(con, raw(), 10 * 1024)
  }
  
  close(con)
  invisible()
}

bench::mark(stream_data(), iterations = 1, filter_gc = FALSE)[1:5]
#> # A tibble: 1 × 5
#>   expression         min   median `itr/sec` mem_alloc
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data()    2.82s    2.82s     0.354     227MB

hadley · 2025-03-24T22:24:59Z

I've forwarded this to @jeroen to take a look at, and since it doesn't appear to be a httr2 issue, I'm going to remove this from the milestone.

hadley · 2025-03-26T12:49:21Z

And closing here since it's now tracked in curl.

jeroen · 2025-04-01T09:14:55Z

@Aariq this should be fixed in curl 6.2.3. You can install the dev version from r-universe:

install.packages("curl", repos = "https://jeroen.r-universe.dev")

hadley · 2025-04-01T13:20:45Z

This is no longer a problem in curl, but it looks like we still have some work to do in httr2:

library(httr2)

url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
  req_body_form(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
# system.time(resp <- req_perform(req))
# length(strsplit(resp_body_string(resp), "\n")[[1]])

stream_data <- function(req, lines) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_lines(con, lines = 100)
  }

  invisible()
}

stream_data_raw <- function(req) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_raw(con, kb = 1)
  }

  invisible()
}

bench::mark(
  stream_data(req),
  stream_data_raw(req),
  iterations = 1,
  filter_gc = FALSE,
  check = FALSE
)[1:5]
#> # A tibble: 2 × 5
#>   expression                min   median `itr/sec` mem_alloc
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data(req)        5.17s    5.17s     0.194     139MB
#> 2 stream_data_raw(req)    5.51s    5.51s     0.181     975KB

^{Created on 2025-04-01 with reprex v2.1.1}

hadley · 2025-04-01T13:59:41Z

Fixing the memory allocations is going to require a couple of hours of work. Will first need to create a ring buffer implementation so that we can retrieve and use raw bytes from the connection without allocation memory. Then need to rewrite the event boundary functions to work with some sort of callback function on the ring buffer.

hadley · 2025-04-01T21:19:42Z

With ~4 hours work:

  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 stream_data(req)        6.83s    6.83s     0.147   43.03MB
2 stream_data_raw(req)    5.38s    5.38s     0.186    1.28MB

I thought it would be a bigger difference 😬 Also it's way slower than the previous code 😞

Fixes #704 Initial benchmark indicates that there's more work to be done: ``` expression min median `itr/sec` mem_alloc <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> 1 stream_data(req) 6.83s 6.83s 0.147 43.03MB 2 stream_data_raw(req) 5.38s 5.38s 0.186 1.28MB ```

Seb-FS-Axpo · 2025-04-22T10:01:15Z

Hi. Any update on this one maybe ?

jeroen · 2025-04-22T10:47:39Z

@Seb-FS-Axpo did you try with the new curl as suggested in #704 (comment)

Seb-FS-Axpo · 2025-04-22T11:57:41Z

Hi @jeroen thansk for the quick feedback.
I was hoping to keep using htt2 implementation and was hoping for more feedback based on
#704 (comment)

jeroen · 2025-04-22T12:21:51Z

@Seb-FS-Axpo httr2 is based on curl. If you upgrade curl, the problem will be fixed in httr2 too.

hadley · 2025-04-22T16:34:19Z

@jeroen I think there's still work to do in httr2, since we also do buffering that seems to be creating a bunch of copies.

Aariq mentioned this issue Mar 13, 2025

Download function issues in 1.3.0 usa-npn/rnpn#104

Closed

hadley added this to the v1.1.2 milestone Mar 24, 2025

hadley removed this from the v1.1.2 milestone Mar 24, 2025

jeroen mentioned this issue Mar 25, 2025

Memory allocation in curl() jeroen/curl#399

Closed

hadley closed this as completed Mar 26, 2025

hadley reopened this Apr 1, 2025

hadley linked a pull request Apr 1, 2025 that will close this issue

Use a ring buffer to reduce memory churn #714

Open

Performance of streaming requests #704

Performance of streaming requests #704

Comments

Aariq commented Mar 13, 2025

hadley commented Mar 13, 2025

Uh oh!

hadley commented Mar 13, 2025

Uh oh!

Aariq commented Mar 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Mar 14, 2025

Uh oh!

hadley commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Mar 26, 2025

Uh oh!

jeroen commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Apr 1, 2025

Uh oh!

hadley commented Apr 1, 2025

Uh oh!

hadley commented Apr 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Seb-FS-Axpo commented Apr 22, 2025

Uh oh!

jeroen commented Apr 22, 2025

Uh oh!

Seb-FS-Axpo commented Apr 22, 2025

Uh oh!

jeroen commented Apr 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hadley commented Apr 22, 2025

Uh oh!

Aariq commented Mar 14, 2025 •

edited

Loading

hadley commented Mar 24, 2025 •

edited

Loading

hadley commented Mar 24, 2025 •

edited

Loading

hadley commented Mar 24, 2025 •

edited

Loading

jeroen commented Apr 1, 2025 •

edited

Loading

hadley commented Apr 1, 2025 •

edited

Loading

jeroen commented Apr 22, 2025 •

edited

Loading