-
Notifications
You must be signed in to change notification settings - Fork 68
Performance of streaming requests #704
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Can you give me a bit more of a realistic use case? It doesn't seem like you get any benefit from streaming here, since you download and process every single line. |
Hmmmm, maybe the different meanings of streaming in httr2 and curl are confusing here. I don't think your use case benefits from httr2 streaming. It is a bit weird that this allocates so much memory though: library(httr2)
library(curl)
#> Using libcurl 8.11.1 with OpenSSL/3.3.2
library(jsonlite)
library(bench)
url <- "https://services.usanpn.org/npn_portal//observations/getSummarizedData.ndjson?"
query <- list(request_src = "benchmarking", climate_data = "0", start_date = "2025-01-01",
end_date = "2025-12-31")
req <- httr2::request(url) %>%
httr2::req_method("POST") %>%
httr2::req_body_form(!!!query)
bench::mark(
httr2 = {
con <- httr2::req_perform_connection(req)
while(!httr2::resp_stream_is_complete(con)) {
resp <- httr2::resp_stream_lines(con, lines = 5000)
}
close(con)
}
) |
The function I'm refactoring downloads potentially 300,000+ rows, does quite a bit of data wrangling to each chunk, and optionally writes each chunk to a file rather than rowbinding to an in-memory dataframe. Now all by the smallest queries seem to be having issues. Users are now running into errors that seem to be due to running out of memory with requests that previously worked without writing to a file. I didn't realize "streaming" had a different meaning here. I'm just hoping to get the ndjson a chunk at a time so I can wrangle it and optionally write it to a file. There might be a better approach though—I think I could only use streaming if an output file is specified and otherwise just read it all in one go, but I'm not sure that would solve the memory issue. |
@Aariq thanks for the context, I'll take a look when I'm back from vacation. FWIW I'd highly recommend that you don't do iterative rowbinding as this is likely to be slow and cause a lot of memory allocations. |
Some more code to help me understand what's going on: library(httr2)
url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
req_body_form(
request_src = "benchmarking",
climate_data = "0",
start_date = "2025-01-01",
end_date = "2025-12-01",
state = "TX"
)
system.time(resp <- req_perform(req))
length(strsplit(resp_body_string(resp), "\n")[[1]])
stream_data <- function(req, lines) {
con <- req_perform_connection(req)
on.exit(close(con))
while(!resp_stream_is_complete(con)) {
resp <- resp_stream_lines(con, lines = lines)
}
invisible()
}
batch_data <- function(req) {
resp <- req_perform(req)
resp_body_string(resp)
invisible()
}
bench::mark(
stream_data(req, 10),
stream_data(req, 100),
stream_data(req, 1000),
batch_data(req),
iterations = 1,
filter_gc = FALSE,
check = FALSE
)[1:5]
#> # A tibble: 4 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 stream_data(req, 10) 2.8s 2.8s 0.357 60MB
#> 2 stream_data(req, 100) 2.86s 2.86s 0.349 59.8MB
#> 3 stream_data(req, 1000) 2.82s 2.82s 0.354 61.1MB
#> 4 download_data(req) 2.56s 2.56s 0.390 423.6KB So even with a smaller example, seeing a lot more memory churn going on. The churn doesn't seem to affect the overall speed that much and seems independent of the chunk size. If I do some memory profilling with profvis: profvis::profvis(stream_data(req, 100)) All of the allocation seems to be happening in |
Ok, if I rewrite this in pure curl, I see the same memory allocation: library(curl)
stream_data <- function() {
url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
body_fields <- c(
request_src = "benchmarking",
climate_data = "0",
start_date = "2025-01-01",
end_date = "2025-12-01",
state = "TX"
)
body <- charToRaw(paste0(paste0(names(body_fields), "=", body_fields), collapse = "&"))
h <- new_handle()
handle_setopt(h, post = TRUE, postfieldsize = length(body), postfields = body)
con <- curl(url, handle = h)
open(con, "rbf", blocking = FALSE)
on.exit(con)
while(isIncomplete(con)) {
readBin(con, raw(), 10 * 1024)
}
close(con)
invisible()
}
bench::mark(stream_data(), iterations = 1, filter_gc = FALSE)[1:5]
#> # A tibble: 1 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 stream_data() 2.82s 2.82s 0.354 227MB |
I've forwarded this to @jeroen to take a look at, and since it doesn't appear to be a httr2 issue, I'm going to remove this from the milestone. |
And closing here since it's now tracked in curl. |
@Aariq this should be fixed in curl 6.2.3. You can install the dev version from r-universe: install.packages("curl", repos = "https://jeroen.r-universe.dev") |
This is no longer a problem in curl, but it looks like we still have some work to do in httr2: library(httr2)
url <- "https://services.usanpn.org/npn_portal/observations/getSummarizedData.ndjson"
req <- request(url) %>%
req_body_form(
request_src = "benchmarking",
climate_data = "0",
start_date = "2025-01-01",
end_date = "2025-12-01",
state = "TX"
)
# system.time(resp <- req_perform(req))
# length(strsplit(resp_body_string(resp), "\n")[[1]])
stream_data <- function(req, lines) {
con <- req_perform_connection(req)
on.exit(close(con))
while(!resp_stream_is_complete(con)) {
resp <- resp_stream_lines(con, lines = 100)
}
invisible()
}
stream_data_raw <- function(req) {
con <- req_perform_connection(req)
on.exit(close(con))
while(!resp_stream_is_complete(con)) {
resp <- resp_stream_raw(con, kb = 1)
}
invisible()
}
bench::mark(
stream_data(req),
stream_data_raw(req),
iterations = 1,
filter_gc = FALSE,
check = FALSE
)[1:5]
#> # A tibble: 2 × 5
#> expression min median `itr/sec` mem_alloc
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt>
#> 1 stream_data(req) 5.17s 5.17s 0.194 139MB
#> 2 stream_data_raw(req) 5.51s 5.51s 0.181 975KB Created on 2025-04-01 with reprex v2.1.1 |
Fixing the memory allocations is going to require a couple of hours of work. Will first need to create a ring buffer implementation so that we can retrieve and use raw bytes from the connection without allocation memory. Then need to rewrite the event boundary functions to work with some sort of callback function on the ring buffer. |
With ~4 hours work:
I thought it would be a bigger difference 😬 Also it's way slower than the previous code 😞 |
Fixes #704 Initial benchmark indicates that there's more work to be done: ``` expression min median `itr/sec` mem_alloc <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> 1 stream_data(req) 6.83s 6.83s 0.147 43.03MB 2 stream_data_raw(req) 5.38s 5.38s 0.186 1.28MB ```
Hi. Any update on this one maybe ? |
@Seb-FS-Axpo did you try with the new curl as suggested in #704 (comment) |
Hi @jeroen thansk for the quick feedback. |
@Seb-FS-Axpo httr2 is based on curl. If you upgrade curl, the problem will be fixed in httr2 too. |
@jeroen I think there's still work to do in httr2, since we also do buffering that seems to be creating a bunch of copies. |
While refactoring the
rnpn
package to usehttr2
, I've discovered that streaming ndjson withreq_perform_connection()
andresp_stream_lines()
takes more time and significantly memory compared to usingcurl
+jsonlite::stream_in()
—so much so that I'm going to have to revert the change as users are running up against memory limitations. I'm not sure if this is just because of additional overhead due to features ofhttr2
or if it is something that can be addressed (or possibly I'm doing things wrong!)For example, a request that uses ~17MB of RAM with
curl
+jsonlite::stream_in()
uses ~1GB of RAM withhttr2
.Full benchmark code:
Created on 2025-03-13 with reprex v2.1.1
Session info
The text was updated successfully, but these errors were encountered: