-
Notifications
You must be signed in to change notification settings - Fork 4
S3 metadata and file size checks #165
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Ah, the new functionality can be enabled for v1 data with an argument: metadata <- spod_available_data(ver = 1, use_s3 = TRUE) Data from S3 is cached with metadata <- spod_available_data(ver = 1, use_s3 = TRUE, s3_force_update = TRUE) I will probably change If all goes well, maybe I would even change the use_s3 to TRUE by default, as it is much better to know the file sizes in advance, and there are unfortunately not in the XML... |
This is great news indeed! |
All is well. Except some files on Amazon S3 are stored with incorrect file type, and the reported size in metadata is incorrect for literally hundreds of files. It is different from the size we can get by HEADing the file url... UPDATE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
This pull request introduces several enhancements, new features, and bug fixes to improve the functionality and reliability of the New Features
Dependency Updates
Bug Fixes and Improvements
Developer Utilities
|
Back to draft temporarily, v2 data also has some etag/checksum mismatches. |
Good luck with the final issues, I'll keep my eyes out but do tag me when ready to review. On a different topic I've started using air format . and it is so quick and effective and it doesn't override my unconventional |
@Robinlovelace 👍 I am using Air within Positron and also enjoying it, that's why some R files changed a lot in this PR. |
Ok, I would actually like to merge this now (perhaps adding an extra warning that some checks with @Robinlovelace would you be able to review the PR, test the new/fixed functions? |
Will take a quick look now. |
Sounds like a good plan. |
Good stuff, will read with interest how that's been replaced. As an aside I've frequently hit issues with installing {curl}. |
Is it still experimental and if so how is that signalled to users? I should probably just look at the code, but making high level comments first. |
All experimental functions are marked in the docs https://ropenspain.github.io/spanishoddata/reference/index.html with experimental label (also visible in Rstudio/Positron), and they also have brief description of what is experimental about them (e.g. https://ropenspain.github.io/spanishoddata/reference/spod_quick_get_od.html clearly says that API may change and the function can break at any time).
Apparently the built-in
yes, #127 and partially #126 . #126 is solved by essentially file size checks. An easy way to test is:
spod_get("nt", zones = "distr", dates = c("2022-01-01", "2022-02-02")) Manually copy the file of "2022-01-01" in place of "2022-01-02". Run the code above again, it should re-download the "2022-01-02". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a huge PR. Not tested locally but the checks are passing and the benefits of having dl size and using AWS interface directly are clear to me so 👍 to merging. Any regressions picked up by this can be fixed post merge. Suggestion: merge, then put out a message asking users to install the GitHub version with
remotes::install_dev("spanishoddata")
And to report any issues.
etag = gsub('\\"', '', .data$ETag) | ||
) |> | ||
dplyr::select( | ||
.data$target_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not
.data$target_url, | |
target_url, |
out of interest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall R CMD check complaining about undefined variables in functions and pointing to those tidy-eval column names, unless they are prepended by .data from rlang.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for example, I forgot to fix the function below
spod_store_etags <- function() {
available_data <- spod_available_data(1, check_local_files = TRUE)
available_data <- available_data |>
dplyr::filter(downloaded == TRUE)
local_etags <- available_data$local_path |>
purrr::map_chr(~ spod_compute_s3_etag(.x), .progress = TRUE)
available_data <- available_data |>
dplyr::mutate(local_etag = local_etags) |>
dplyr::as_tibble()
return(available_data)
}
I get:
❯ checking R code for possible problems ... NOTE
spod_store_etags: no visible binding for global variable ‘downloaded’
Undefined global functions or variables:
downloaded
Then if I prepend downloaded
with bangs(is that what they called...?)/exclamation marks:
dplyr::filter(!!downloaded == TRUE)
Same NOTE:
❯ checking R code for possible problems ... NOTE
spod_store_etags: no visible binding for global variable ‘downloaded’
Undefined global functions or variables:
downloaded
Then replacing the problematic line with:
dplyr::filter(.data$downloaded == TRUE)
No notes anymore 🤷♂️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's a fair reason for using the .data
syntax.
I think messages like this
spod_store_etags: no visible binding for global variable ‘downloaded’
Can be resolved with utils::globalVariables("downloaded")
somewhere in the package code, as outlined here https://forum.posit.co/t/how-to-solve-no-visible-binding-for-global-variable-note/28887/2 but in the very next post the .data
syntax is recommended. Was just trying to understand the reasoning.
Great news. We now have experimental metadata fetching (for v1 data only for now) from the Amazon S3 bucket where MITMS stores their CSVs (so it is NOT a re-upload by me).
I also implemented an optional file size check and tested how it performs. I tested this on the full v1 dataset of 400+ files for districts:
It seem like the cost of getting the file size is basically identical to just checking if the file is there. So I would say we replace the check for file existence entirely by the local file size fetch.
Key takeaways so far:
aws.s3
package, but perhaps I can replace it later with a simplerhttr2
code that does not need a whole new import).Therefore we will soon be able to fix #126, #127.
S3 metadata also has ETag which is a fancy md5 checksum. I already know how to calculate it for local files, but it takes a lot of time. So it is possible to create another small helper function that a user could use to verify the integrity of their data in a not-so-likely event that the file-sizes match, but the data is corrupted.
This draft pull request will stay here while I am working on implementing the S3 metadata for both data versions, file size checks and download resuming when curl multi download fails silently.