Skip to content

S3 metadata and file size checks #165

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 52 commits into from
Jun 13, 2025
Merged

S3 metadata and file size checks #165

merged 52 commits into from
Jun 13, 2025

Conversation

e-kotov
Copy link
Member

@e-kotov e-kotov commented May 16, 2025

Great news. We now have experimental metadata fetching (for v1 data only for now) from the Amazon S3 bucket where MITMS stores their CSVs (so it is NOT a re-upload by me).

I also implemented an optional file size check and tested how it performs. I tested this on the full v1 dataset of 400+ files for districts:

Screenshot 2025-05-16 at 18 23 44

It seem like the cost of getting the file size is basically identical to just checking if the file is there. So I would say we replace the check for file existence entirely by the local file size fetch.

Key takeaways so far:

  1. We can reliably and quickly get file sizes of all files (with aws.s3 package, but perhaps I can replace it later with a simpler httr2 code that does not need a whole new import).
  2. We can quickly check the file sizes on disk and therefore identify if any file is damaged.

Therefore we will soon be able to fix #126, #127.

S3 metadata also has ETag which is a fancy md5 checksum. I already know how to calculate it for local files, but it takes a lot of time. So it is possible to create another small helper function that a user could use to verify the integrity of their data in a not-so-likely event that the file-sizes match, but the data is corrupted.

This draft pull request will stay here while I am working on implementing the S3 metadata for both data versions, file size checks and download resuming when curl multi download fails silently.

@e-kotov
Copy link
Member Author

e-kotov commented May 16, 2025

Ah, the new functionality can be enabled for v1 data with an argument:

metadata <- spod_available_data(ver = 1, use_s3 = TRUE)

Data from S3 is cached with memoise (perhaps it is a good idea to cache the metadata we read from xml too, for speed). And this can be overridden with:

metadata <- spod_available_data(ver = 1, use_s3 = TRUE, s3_force_update = TRUE)

I will probably change s3_force_update to something else if I cache the xml file reads too.

If all goes well, maybe I would even change the use_s3 to TRUE by default, as it is much better to know the file sizes in advance, and there are unfortunately not in the XML...

@Robinlovelace
Copy link
Collaborator

This is great news indeed!

@e-kotov
Copy link
Member Author

e-kotov commented May 17, 2025

Screenshot 2025-05-17 at 11 40 40

Here are some more results. Testing on 1359 files of v2 data actually present on a network attached SSD in HPC.

Overall, file size check time cost seems reasonable in every spod_download call by default to avoid potentially broken files.

For internal use, spod_available_data_v1 and spod_available_data_v2 can be enhanced to use a filter for currently requested type of data for both the call to Amazon s3 for file sizes and the call to the file system for file checks.

Further plan in this branch and PR:

  • completely replace the check for existence of files on disk with file size check, as the latter clearly has more value while taking the same time
  • turn on local file size check in spod_available_data() by default when using internally in spod_download()
  • for each download request, include the files with mismatching sizes into calls to curl::multi_download() so that it can resume the downloads I could not make curl::multi_download() behave and reliably download files in batch. On some connection it just fails. Switching to base R downloads in sequential mode.
  • maybe make fetching metadata from S3 the default in spod_available_data() instead of XML file. As the files are hosted on S3, the official RSS.xml file with download links does not offer as much value (does not have file sizes and does not have (ETags that can be potentially used for md5 checksum verification of local files) and is secondary to what S3 API returns. I will do a few speed tests to check which is faster. If S3 becomes the default, RSS.xml download can always be used as a fall-back.
  • maybe cache metadata inside R package as RDS (or better rehost on GitHub in the package repo - the RDS download may be much faster than fetching the XML data from S3 via APU, and with a hosted RDS we do not have to update the package every once in a while just to have the updated RDS metadata) for known files (for the last 5 years) to save up on requests to S3 for even better speed. This way, we can create a filter for S3 file listing requests and not request the data for old files for which we already know the file sizes. We can also setup a simple script in GitHub actions that will keep this RDS cache updated in a daily basis. I have not noticed old files being updated. My worry is that it CAN happen at some point if some error in the past data is identified and the data is updated. UPDATE: I cached true sizes for v1 data. For v2 data it is ok to get file sizes from S3, they seem correct.

@e-kotov
Copy link
Member Author

e-kotov commented May 18, 2025

All is well. Except some files on Amazon S3 are stored with incorrect file type, and the reported size in metadata is incorrect for literally hundreds of files. It is different from the size we can get by HEADing the file url...

UPDATE
Thankfully, this is only for v1 data. So I'm just going to cache the v1 metadata.

@e-kotov e-kotov requested review from Copilot and Robinlovelace May 21, 2025 12:39
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@e-kotov
Copy link
Member Author

e-kotov commented May 21, 2025

This pull request introduces several enhancements, new features, and bug fixes to improve the functionality and reliability of the spanishoddata package. Key changes include the addition of new functions for S3 metadata handling and file consistency checks, removal of the curl dependency, and improvements to data directory management. Below is a categorized summary of the most important changes:

New Features

  • Added spod_check_files() function to verify the consistency of downloaded files against Amazon S3 checksums. This function is experimental and allows users to validate file integrity. It works similar to other functions and requires type, zones and dates arguments. It also has an n_threads option to do the verification in parallel using the cutting edge {future.mirai} backend, installable on demand. Using n_threads make sense when checking more than about 50 large files and if you have 6-8 cores to do the job.
  • Introduced spod_available_data_s3() function internally for fetching metadata about available data files directly from Amazon S3, with caching and memoization for efficiency. It falls back to previously used XML file download. spod_available_data() is still the user facing function to use.
  • v1 data file sizes and etags from S3 are recalculated using downloaded files, as S3 reported data was inaccurate. This is cached inside the package. This is not an issue for v2 data.

Dependency Updates

  • Removed the curl dependency due to issues with curl::multi_download() on some connections. File downloads now use base R's utils::download.file() with libcurl backend which also pulls several files simultaneously.
  • Added paws.storage (>= 0.4.0) as a new dependency for interacting with Amazon S3.

Bug Fixes and Improvements

  • Improved metadata fetching from Amazon S3, enabling validation of downloaded files using both size and checksum.
  • Enhanced spod_convert() to support overwrite = 'update' with save_format = 'parquet'.

Developer Utilities

  • Added spod_files_sizes() to compute and save file sizes and ETags for v1 data, so that it can be cached.

@e-kotov e-kotov marked this pull request as draft May 21, 2025 15:45
@e-kotov
Copy link
Member Author

e-kotov commented May 21, 2025

Back to draft temporarily, v2 data also has some etag/checksum mismatches.

@Robinlovelace
Copy link
Collaborator

Good luck with the final issues, I'll keep my eyes out but do tag me when ready to review. On a different topic I've started using

air format .

and it is so quick and effective and it doesn't override my unconventional = assignment operators on my packages which is nice, but will stick with convention and use <- here.

@e-kotov
Copy link
Member Author

e-kotov commented May 21, 2025

@Robinlovelace 👍 I am using Air within Positron and also enjoying it, that's why some R files changed a lot in this PR.

@e-kotov e-kotov marked this pull request as ready for review June 10, 2025 09:40
@e-kotov
Copy link
Member Author

e-kotov commented Jun 10, 2025

Ok, I would actually like to merge this now (perhaps adding an extra warning that some checks with spod_check_files may fail, with known cases for May 2022 and some dates on 2025), as this update is important for the upcoming summer workshops, as it provides more reliable data download than the current CRAN version. I would get back to implementing the #126 later this summer. The file size check is already reliable enough, so checksum checks are more of a nice to have a this moment.

@Robinlovelace would you be able to review the PR, test the new/fixed functions?

@Robinlovelace
Copy link
Collaborator

Will take a quick look now.

@Robinlovelace
Copy link
Collaborator

It seem like the cost of getting the file size is basically identical to just checking if the file is there. So I would say we replace the check for file existence entirely by the local file size fetch.

Sounds like a good plan.

@Robinlovelace
Copy link
Collaborator

This PR fixes these as far as I can tell:

Therefore we will soon be able to fix #126, #127.

Let me know if not @e-kotov

@Robinlovelace
Copy link
Collaborator

removal of the curl dependency

Good stuff, will read with interest how that's been replaced. As an aside I've frequently hit issues with installing {curl}.

@Robinlovelace
Copy link
Collaborator

Added spod_check_files() function to verify the consistency of downloaded files against Amazon S3 checksums. This function is experimental

Is it still experimental and if so how is that signalled to users? I should probably just look at the code, but making high level comments first.

@e-kotov
Copy link
Member Author

e-kotov commented Jun 10, 2025

Is it still experimental and if so how is that signalled to users?

All experimental functions are marked in the docs https://ropenspain.github.io/spanishoddata/reference/index.html with experimental label (also visible in Rstudio/Positron), and they also have brief description of what is experimental about them (e.g. https://ropenspain.github.io/spanishoddata/reference/spod_quick_get_od.html clearly says that API may change and the function can break at any time).

how that's been replaced.

Apparently the built-in download.data() is very good at parallel downloads. I had to do some hacking to make the progress bar work. It is not as informative and near-realtime as it was with curl::multi_download, but at least it does not fail, and I can control the re-downloading easier then with curl::multi_download.

This PR fixes these as far as I can tell:

Therefore we will soon be able to fix #126, #127.

Let me know if not @e-kotov

yes, #127 and partially #126 . #126 is solved by essentially file size checks.

An easy way to test is:

  1. Get some data:
spod_get("nt", zones = "distr", dates = c("2022-01-01", "2022-02-02"))

Manually copy the file of "2022-01-01" in place of "2022-01-02". Run the code above again, it should re-download the "2022-01-02".

Copy link
Collaborator

@Robinlovelace Robinlovelace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a huge PR. Not tested locally but the checks are passing and the benefits of having dl size and using AWS interface directly are clear to me so 👍 to merging. Any regressions picked up by this can be fixed post merge. Suggestion: merge, then put out a message asking users to install the GitHub version with

remotes::install_dev("spanishoddata")

And to report any issues.

etag = gsub('\\"', '', .data$ETag)
) |>
dplyr::select(
.data$target_url,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

Suggested change
.data$target_url,
target_url,

out of interest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This use case seems different, in our case target_url is the hardcoded column name, not a text string, right?

for (var in names(mtcars)) {
mtcars %>% count(.data[[var]]) %>% print()
}

Copy link
Member Author

@e-kotov e-kotov Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall R CMD check complaining about undefined variables in functions and pointing to those tidy-eval column names, unless they are prepended by .data from rlang.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Robinlovelace

for example, I forgot to fix the function below

spod_store_etags <- function() {
  available_data <- spod_available_data(1, check_local_files = TRUE)
  available_data <- available_data |>
    dplyr::filter(downloaded == TRUE)
  local_etags <- available_data$local_path |>
    purrr::map_chr(~ spod_compute_s3_etag(.x), .progress = TRUE)
  available_data <- available_data |>
    dplyr::mutate(local_etag = local_etags) |>
    dplyr::as_tibble()
  return(available_data)
}

I get:

❯ checking R code for possible problems ... NOTE
  spod_store_etags: no visible binding for global variable ‘downloaded’
  Undefined global functions or variables:
    downloaded

Then if I prepend downloaded with bangs(is that what they called...?)/exclamation marks:

dplyr::filter(!!downloaded == TRUE)

Same NOTE:

❯ checking R code for possible problems ... NOTE
  spod_store_etags: no visible binding for global variable ‘downloaded’
  Undefined global functions or variables:
    downloaded

Then replacing the problematic line with:

dplyr::filter(.data$downloaded == TRUE)

No notes anymore 🤷‍♂️

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that's a fair reason for using the .data syntax.

I think messages like this

  spod_store_etags: no visible binding for global variable ‘downloaded’

Can be resolved with utils::globalVariables("downloaded") somewhere in the package code, as outlined here https://forum.posit.co/t/how-to-solve-no-visible-binding-for-global-variable-note/28887/2 but in the very next post the .data syntax is recommended. Was just trying to understand the reasoning.

@e-kotov e-kotov merged commit 9d0ac9d into main Jun 13, 2025
5 checks passed
@e-kotov e-kotov deleted the s3-metadata branch June 13, 2025 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

validation of downloaded data
2 participants