Skip to content

Performance of streaming requests #704

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
Aariq opened this issue Mar 13, 2025 · 17 comments · May be fixed by #714
Open

Performance of streaming requests #704

Aariq opened this issue Mar 13, 2025 · 17 comments · May be fixed by #714

Comments

@Aariq
Copy link

Aariq commented Mar 13, 2025

While refactoring the rnpn package to use httr2, I've discovered that streaming ndjson with req_perform_connection() and resp_stream_lines() takes more time and significantly memory compared to using curl + jsonlite::stream_in()—so much so that I'm going to have to revert the change as users are running up against memory limitations. I'm not sure if this is just because of additional overhead due to features of httr2 or if it is something that can be addressed (or possibly I'm doing things wrong!)

For example, a request that uses ~17MB of RAM with curl + jsonlite::stream_in() uses ~1GB of RAM with httr2.

Full benchmark code:

library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)

url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01", 
              end_date = "2025-12-31")

bench::mark(
  httr2 = {
    req <- httr2::request(url) %>%
      httr2::req_method("POST") %>%
      httr2::req_body_form(!!!query)
    
    con <- httr2::req_perform_connection(req)
    out_httr2 <- tibble::tibble()
    
    while(!httr2::resp_stream_is_complete(con)) {
      resp <- httr2::resp_stream_lines(con, lines = 5000)
      df <- resp %>% 
        textConnection() %>% 
        jsonlite::stream_in(verbose = FALSE, pagesize = 5000)
      out_httr2 <- dplyr::bind_rows(out_httr2, df)
    }
    close(con)
    out_httr2
  },
  
  curl = {
    query2 <- c(query, customrequest = "POST")
    h <- new_handle() %>% handle_setform(.list = query2)
    
    con <- curl(url, handle = h)
    out_curl <- tibble::tibble()
    
    jsonlite::stream_in(con, function(df) {
      #I know this isn't necessary, but in the real code data wrangling happens
      #in the callback function
      out_curl <<- dplyr::bind_rows(out_curl, df) 
    }, verbose = FALSE, pagesize = 5000)
    out_curl
  }
)
#> Warning: Some expressions had a GC in every iteration; so filtering is
#> disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 httr2         14.6s    14.6s    0.0683    1.04GB   1.37  
#> 2 curl          14.6s    14.6s    0.0687   17.79MB   0.0687

Created on 2025-03-13 with reprex v2.1.1

Session info

sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.4.3 (2025-02-28)
#>  os       macOS Sequoia 15.3.2
#>  system   x86_64, darwin20
#>  ui       X11
#>  language (EN)
#>  collate  en_US.UTF-8
#>  ctype    en_US.UTF-8
#>  tz       America/Phoenix
#>  date     2025-03-13
#>  pandoc   3.6.2 @ /usr/local/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package     * version date (UTC) lib source
#>  bench       * 1.1.4   2025-01-16 [1] CRAN (R 4.4.1)
#>  cli           3.6.4   2025-02-13 [1] CRAN (R 4.4.1)
#>  curl        * 6.2.1   2025-02-19 [1] CRAN (R 4.4.1)
#>  digest        0.6.37  2024-08-19 [1] CRAN (R 4.4.1)
#>  dplyr         1.1.4   2023-11-17 [1] CRAN (R 4.4.0)
#>  evaluate      1.0.3   2025-01-10 [1] CRAN (R 4.4.1)
#>  fastmap       1.2.0   2024-05-15 [1] CRAN (R 4.4.0)
#>  fs            1.6.5   2024-10-30 [1] CRAN (R 4.4.1)
#>  generics      0.1.3   2022-07-05 [1] CRAN (R 4.4.0)
#>  glue          1.8.0   2024-09-30 [1] CRAN (R 4.4.1)
#>  htmltools     0.5.8.1 2024-04-04 [1] CRAN (R 4.4.0)
#>  httr2       * 1.1.1   2025-03-08 [1] CRAN (R 4.4.1)
#>  jsonlite    * 1.8.9   2024-09-20 [1] CRAN (R 4.4.1)
#>  knitr         1.49    2024-11-08 [1] CRAN (R 4.4.1)
#>  lifecycle     1.0.4   2023-11-07 [1] CRAN (R 4.4.0)
#>  magrittr      2.0.3   2022-03-30 [1] CRAN (R 4.4.0)
#>  pillar        1.10.1  2025-01-07 [1] CRAN (R 4.4.1)
#>  pkgconfig     2.0.3   2019-09-22 [1] CRAN (R 4.4.0)
#>  profmem       0.6.0   2020-12-13 [1] CRAN (R 4.4.0)
#>  R6            2.6.1   2025-02-15 [1] CRAN (R 4.4.1)
#>  rappdirs      0.3.3   2021-01-31 [1] CRAN (R 4.4.0)
#>  reprex        2.1.1   2024-07-06 [1] CRAN (R 4.4.0)
#>  rlang         1.1.5   2025-01-17 [1] CRAN (R 4.4.1)
#>  rmarkdown     2.29    2024-11-04 [1] CRAN (R 4.4.1)
#>  rstudioapi    0.17.1  2024-10-22 [1] CRAN (R 4.4.1)
#>  sessioninfo   1.2.2   2021-12-06 [1] CRAN (R 4.4.0)
#>  tibble        3.2.1   2023-03-20 [1] CRAN (R 4.4.0)
#>  tidyselect    1.2.1   2024-03-11 [1] CRAN (R 4.4.0)
#>  utf8          1.2.4   2023-10-22 [1] CRAN (R 4.4.0)
#>  vctrs         0.6.5   2023-12-01 [1] CRAN (R 4.4.0)
#>  withr         3.0.2   2024-10-28 [1] CRAN (R 4.4.1)
#>  xfun          0.50    2025-01-07 [1] CRAN (R 4.4.1)
#>  yaml          2.3.10  2024-07-26 [1] CRAN (R 4.4.0)
#> 
#>  [1] /Users/ericscott/Library/R/x86_64/4.4/library
#>  [2] /Library/Frameworks/R.framework/Versions/4.4-x86_64/Resources/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────
@hadley
Copy link
Member

hadley commented Mar 13, 2025

Can you give me a bit more of a realistic use case? It doesn't seem like you get any benefit from streaming here, since you download and process every single line.

@hadley
Copy link
Member

hadley commented Mar 13, 2025

Hmmmm, maybe the different meanings of streaming in httr2 and curl are confusing here. I don't think your use case benefits from httr2 streaming.

It is a bit weird that this allocates so much memory though:

library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)

url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01", 
              end_date = "2025-12-31")

req <- httr2::request(url) %>%
  httr2::req_method("POST") %>%
  httr2::req_body_form(!!!query)

bench::mark(
  httr2 = {
    con <- httr2::req_perform_connection(req)

    while(!httr2::resp_stream_is_complete(con)) {
      resp <- httr2::resp_stream_lines(con, lines = 5000)
    }
    close(con)    
  }
)

@Aariq
Copy link
Author

Aariq commented Mar 14, 2025

The function I'm refactoring downloads potentially 300,000+ rows, does quite a bit of data wrangling to each chunk, and optionally writes each chunk to a file rather than rowbinding to an in-memory dataframe. Now all by the smallest queries seem to be having issues.

Users are now running into errors that seem to be due to running out of memory with requests that previously worked without writing to a file.

I didn't realize "streaming" had a different meaning here. I'm just hoping to get the ndjson a chunk at a time so I can wrangle it and optionally write it to a file. There might be a better approach though—I think I could only use streaming if an output file is specified and otherwise just read it all in one go, but I'm not sure that would solve the memory issue.

@hadley
Copy link
Member

hadley commented Mar 14, 2025

@Aariq thanks for the context, I'll take a look when I'm back from vacation. FWIW I'd highly recommend that you don't do iterative rowbinding as this is likely to be slow and cause a lot of memory allocations.

@hadley hadley added this to the v1.1.2 milestone Mar 24, 2025
@hadley
Copy link
Member

hadley commented Mar 24, 2025

Some more code to help me understand what's going on:

library(httr2)

url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
  req_body_form(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
system.time(resp <- req_perform(req))
length(strsplit(resp_body_string(resp), "\n")[[1]])

stream_data <- function(req, lines) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_lines(con, lines = lines)
  }

  invisible()
}
batch_data <- function(req) {
  resp <- req_perform(req)
  resp_body_string(resp)
  invisible()
}

bench::mark(
  stream_data(req, 10),
  stream_data(req, 100),
  stream_data(req, 1000),
  batch_data(req),
  iterations = 1,
  filter_gc = FALSE,
  check = FALSE
)[1:5]
#> # A tibble: 4 × 5
#> expression                  min   median `itr/sec` mem_alloc
#> <bch:expr>             <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data(req, 10)       2.8s     2.8s     0.357      60MB
#> 2 stream_data(req, 100)     2.86s    2.86s     0.349    59.8MB
#> 3 stream_data(req, 1000)    2.82s    2.82s     0.354    61.1MB
#> 4 download_data(req)        2.56s    2.56s     0.390   423.6KB

So even with a smaller example, seeing a lot more memory churn going on. The churn doesn't seem to affect the overall speed that much and seems independent of the chunk size.

If I do some memory profilling with profvis:

profvis::profvis(stream_data(req, 100))

All of the allocation seems to be happening in readBin(), which I'm frankly surprised by because I wouldn't have thought that would allocate in R at all?

@hadley
Copy link
Member

hadley commented Mar 24, 2025

Ok, if I rewrite this in pure curl, I see the same memory allocation:

library(curl)

stream_data <- function() {
  url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
  body_fields <- c(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
  body <- charToRaw(paste0(paste0(names(body_fields), "=", body_fields), collapse = "&"))

  h <- new_handle()
  handle_setopt(h, post = TRUE, postfieldsize = length(body), postfields = body)
  
  con <- curl(url, handle = h)
  open(con, "rbf", blocking = FALSE)
  on.exit(con)
  
  while(isIncomplete(con)) {
    readBin(con, raw(), 10 * 1024)
  }
  
  close(con)
  invisible()
}

bench::mark(stream_data(), iterations = 1, filter_gc = FALSE)[1:5]
#> # A tibble: 1 × 5
#>   expression         min   median `itr/sec` mem_alloc
#>   <bch:expr>    <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data()    2.82s    2.82s     0.354     227MB

@hadley
Copy link
Member

hadley commented Mar 24, 2025

I've forwarded this to @jeroen to take a look at, and since it doesn't appear to be a httr2 issue, I'm going to remove this from the milestone.

@hadley hadley removed this from the v1.1.2 milestone Mar 24, 2025
@hadley
Copy link
Member

hadley commented Mar 26, 2025

And closing here since it's now tracked in curl.

@hadley hadley closed this as completed Mar 26, 2025
@jeroen
Copy link
Member

jeroen commented Apr 1, 2025

@Aariq this should be fixed in curl 6.2.3. You can install the dev version from r-universe:

install.packages("curl", repos = "https://jeroen.r-universe.dev")

@hadley
Copy link
Member

hadley commented Apr 1, 2025

This is no longer a problem in curl, but it looks like we still have some work to do in httr2:

library(httr2)

url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
  req_body_form(
    request_src = "benchmarking",
    climate_data = "0",
    start_date = "2025-01-01",
    end_date = "2025-12-01",
    state = "TX"
  )
# system.time(resp <- req_perform(req))
# length(strsplit(resp_body_string(resp), "\n")[[1]])

stream_data <- function(req, lines) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_lines(con, lines = 100)
  }

  invisible()
}

stream_data_raw <- function(req) {
  con <- req_perform_connection(req)
  on.exit(close(con))

  while(!resp_stream_is_complete(con)) {
    resp <- resp_stream_raw(con, kb = 1)
  }

  invisible()
}

bench::mark(
  stream_data(req),
  stream_data_raw(req),
  iterations = 1,
  filter_gc = FALSE,
  check = FALSE
)[1:5]
#> # A tibble: 2 × 5
#>   expression                min   median `itr/sec` mem_alloc
#>   <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
#> 1 stream_data(req)        5.17s    5.17s     0.194     139MB
#> 2 stream_data_raw(req)    5.51s    5.51s     0.181     975KB

Created on 2025-04-01 with reprex v2.1.1

@hadley hadley reopened this Apr 1, 2025
@hadley
Copy link
Member

hadley commented Apr 1, 2025

Fixing the memory allocations is going to require a couple of hours of work. Will first need to create a ring buffer implementation so that we can retrieve and use raw bytes from the connection without allocation memory. Then need to rewrite the event boundary functions to work with some sort of callback function on the ring buffer.

@hadley
Copy link
Member

hadley commented Apr 1, 2025

With ~4 hours work:

  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 stream_data(req)        6.83s    6.83s     0.147   43.03MB
2 stream_data_raw(req)    5.38s    5.38s     0.186    1.28MB

I thought it would be a bigger difference 😬 Also it's way slower than the previous code 😞

hadley added a commit that referenced this issue Apr 1, 2025
Fixes #704

Initial benchmark indicates that there's more work to be done:

```
  expression                min   median `itr/sec` mem_alloc
  <bch:expr>           <bch:tm> <bch:tm>     <dbl> <bch:byt>
1 stream_data(req)        6.83s    6.83s     0.147   43.03MB
2 stream_data_raw(req)    5.38s    5.38s     0.186    1.28MB
```
@hadley hadley linked a pull request Apr 1, 2025 that will close this issue
@Seb-FS-Axpo
Copy link

Hi. Any update on this one maybe ?

@jeroen
Copy link
Member

jeroen commented Apr 22, 2025

@Seb-FS-Axpo did you try with the new curl as suggested in #704 (comment)

@Seb-FS-Axpo
Copy link

Hi @jeroen thansk for the quick feedback.
I was hoping to keep using htt2 implementation and was hoping for more feedback based on
#704 (comment)

@jeroen
Copy link
Member

jeroen commented Apr 22, 2025

@Seb-FS-Axpo httr2 is based on curl. If you upgrade curl, the problem will be fixed in httr2 too.

@hadley
Copy link
Member

hadley commented Apr 22, 2025

@jeroen I think there's still work to do in httr2, since we also do buffering that seems to be creating a bunch of copies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants