S3 metadata and file size checks #165

e-kotov · 2025-05-16T16:31:16Z

Great news. We now have experimental metadata fetching (for v1 data only for now) from the Amazon S3 bucket where MITMS stores their CSVs (so it is NOT a re-upload by me).

I also implemented an optional file size check and tested how it performs. I tested this on the full v1 dataset of 400+ files for districts:

It seem like the cost of getting the file size is basically identical to just checking if the file is there. So I would say we replace the check for file existence entirely by the local file size fetch.

Key takeaways so far:

We can reliably and quickly get file sizes of all files (with aws.s3 package, but perhaps I can replace it later with a simpler httr2 code that does not need a whole new import).
We can quickly check the file sizes on disk and therefore identify if any file is damaged.

Therefore we will soon be able to fix #126, #127.

S3 metadata also has ETag which is a fancy md5 checksum. I already know how to calculate it for local files, but it takes a lot of time. So it is possible to create another small helper function that a user could use to verify the integrity of their data in a not-so-likely event that the file-sizes match, but the data is corrupted.

This draft pull request will stay here while I am working on implementing the S3 metadata for both data versions, file size checks and download resuming when curl multi download fails silently.

e-kotov · 2025-05-16T16:35:44Z

Ah, the new functionality can be enabled for v1 data with an argument:

metadata <- spod_available_data(ver = 1, use_s3 = TRUE)

Data from S3 is cached with memoise (perhaps it is a good idea to cache the metadata we read from xml too, for speed). And this can be overridden with:

metadata <- spod_available_data(ver = 1, use_s3 = TRUE, s3_force_update = TRUE)

I will probably change s3_force_update to something else if I cache the xml file reads too.

If all goes well, maybe I would even change the use_s3 to TRUE by default, as it is much better to know the file sizes in advance, and there are unfortunately not in the XML...

Robinlovelace · 2025-05-16T21:37:11Z

This is great news indeed!

e-kotov · 2025-05-17T09:56:49Z

Here are some more results. Testing on 1359 files of v2 data actually present on a network attached SSD in HPC.

Overall, file size check time cost seems reasonable in every spod_download call by default to avoid potentially broken files.

For internal use, spod_available_data_v1 and spod_available_data_v2 can be enhanced to use a filter for currently requested type of data for both the call to Amazon s3 for file sizes and the call to the file system for file checks.

Further plan in this branch and PR:

completely replace the check for existence of files on disk with file size check, as the latter clearly has more value while taking the same time
turn on local file size check in spod_available_data() by default when using internally in spod_download()
~~for each download request, include the files with mismatching sizes into calls to curl::multi_download() so that it can resume the downloads~~ I could not make curl::multi_download() behave and reliably download files in batch. On some connection it just fails. Switching to base R downloads in sequential mode.
maybe make fetching metadata from S3 the default in spod_available_data() instead of XML file. As the files are hosted on S3, the official RSS.xml file with download links does not offer as much value (does not have file sizes and does not have (ETags that can be potentially used for md5 checksum verification of local files) and is secondary to what S3 API returns. I will do a few speed tests to check which is faster. If S3 becomes the default, RSS.xml download can always be used as a fall-back.
maybe cache metadata inside R package as RDS (or better rehost on GitHub in the package repo - the RDS download may be much faster than fetching the XML data from S3 via APU, and with a hosted RDS we do not have to update the package every once in a while just to have the updated RDS metadata) for known files (for the last 5 years) to save up on requests to S3 for even better speed. This way, we can create a filter for S3 file listing requests and not request the data for old files for which we already know the file sizes. We can also setup a simple script in GitHub actions that will keep this RDS cache updated in a daily basis. I have not noticed old files being updated. My worry is that it CAN happen at some point if some error in the past data is identified and the data is updated. UPDATE: I cached true sizes for v1 data. For v2 data it is ok to get file sizes from S3, they seem correct.

…a fetching

…ckage

…ing with paws

…ome connections

e-kotov · 2025-05-18T18:39:08Z

All is well. Except some files on Amazon S3 are stored with incorrect file type, and the reported size in metadata is incorrect for literally hundreds of files. It is different from the size we can get by HEADing the file url...

UPDATE
Thankfully, this is only for v1 data. So I'm just going to cache the v1 metadata.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

e-kotov · 2025-05-21T12:48:06Z

This pull request introduces several enhancements, new features, and bug fixes to improve the functionality and reliability of the spanishoddata package. Key changes include the addition of new functions for S3 metadata handling and file consistency checks, removal of the curl dependency, and improvements to data directory management. Below is a categorized summary of the most important changes:

New Features

Added spod_check_files() function to verify the consistency of downloaded files against Amazon S3 checksums. This function is experimental and allows users to validate file integrity. It works similar to other functions and requires type, zones and dates arguments. It also has an n_threads option to do the verification in parallel using the cutting edge {future.mirai} backend, installable on demand. Using n_threads make sense when checking more than about 50 large files and if you have 6-8 cores to do the job.
Introduced spod_available_data_s3() function internally for fetching metadata about available data files directly from Amazon S3, with caching and memoization for efficiency. It falls back to previously used XML file download. spod_available_data() is still the user facing function to use.
v1 data file sizes and etags from S3 are recalculated using downloaded files, as S3 reported data was inaccurate. This is cached inside the package. This is not an issue for v2 data.

Dependency Updates

Removed the curl dependency due to issues with curl::multi_download() on some connections. File downloads now use base R's utils::download.file() with libcurl backend which also pulls several files simultaneously.
Added paws.storage (>= 0.4.0) as a new dependency for interacting with Amazon S3.

Bug Fixes and Improvements

Improved metadata fetching from Amazon S3, enabling validation of downloaded files using both size and checksum.
Enhanced spod_convert() to support overwrite = 'update' with save_format = 'parquet'.

Developer Utilities

Added spod_files_sizes() to compute and save file sizes and ETags for v1 data, so that it can be cached.

e-kotov · 2025-05-21T15:46:19Z

Back to draft temporarily, v2 data also has some etag/checksum mismatches.

Robinlovelace · 2025-05-21T15:57:32Z

Good luck with the final issues, I'll keep my eyes out but do tag me when ready to review. On a different topic I've started using

air format .

and it is so quick and effective and it doesn't override my unconventional = assignment operators on my packages which is nice, but will stick with convention and use <- here.

e-kotov · 2025-05-21T15:59:18Z

@Robinlovelace 👍 I am using Air within Positron and also enjoying it, that's why some R files changed a lot in this PR.

e-kotov · 2025-06-10T10:11:43Z

Ok, I would actually like to merge this now (perhaps adding an extra warning that some checks with spod_check_files may fail, with known cases for May 2022 and some dates on 2025), as this update is important for the upcoming summer workshops, as it provides more reliable data download than the current CRAN version. I would get back to implementing the #126 later this summer. The file size check is already reliable enough, so checksum checks are more of a nice to have a this moment.

@Robinlovelace would you be able to review the PR, test the new/fixed functions?

Robinlovelace · 2025-06-10T10:14:43Z

Will take a quick look now.

Robinlovelace · 2025-06-10T10:15:30Z

It seem like the cost of getting the file size is basically identical to just checking if the file is there. So I would say we replace the check for file existence entirely by the local file size fetch.

Sounds like a good plan.

Robinlovelace · 2025-06-10T10:16:27Z

This PR fixes these as far as I can tell:

Therefore we will soon be able to fix #126, #127.

Let me know if not @e-kotov

Robinlovelace · 2025-06-10T10:18:44Z

removal of the curl dependency

Good stuff, will read with interest how that's been replaced. As an aside I've frequently hit issues with installing {curl}.

Robinlovelace · 2025-06-10T10:19:58Z

Added spod_check_files() function to verify the consistency of downloaded files against Amazon S3 checksums. This function is experimental

Is it still experimental and if so how is that signalled to users? I should probably just look at the code, but making high level comments first.

e-kotov · 2025-06-10T10:26:40Z

Is it still experimental and if so how is that signalled to users?

All experimental functions are marked in the docs https://ropenspain.github.io/spanishoddata/reference/index.html with experimental label (also visible in Rstudio/Positron), and they also have brief description of what is experimental about them (e.g. https://ropenspain.github.io/spanishoddata/reference/spod_quick_get_od.html clearly says that API may change and the function can break at any time).

how that's been replaced.

Apparently the built-in download.data() is very good at parallel downloads. I had to do some hacking to make the progress bar work. It is not as informative and near-realtime as it was with curl::multi_download, but at least it does not fail, and I can control the re-downloading easier then with curl::multi_download.

This PR fixes these as far as I can tell:

Therefore we will soon be able to fix #126, #127.

Let me know if not @e-kotov

yes, #127 and partially #126 . #126 is solved by essentially file size checks.

An easy way to test is:

Get some data:

spod_get("nt", zones = "distr", dates = c("2022-01-01", "2022-02-02"))

Manually copy the file of "2022-01-01" in place of "2022-01-02". Run the code above again, it should re-download the "2022-01-02".

Robinlovelace

This is a huge PR. Not tested locally but the checks are passing and the benefits of having dl size and using AWS interface directly are clear to me so 👍 to merging. Any regressions picked up by this can be fixed post merge. Suggestion: merge, then put out a message asking users to install the GitHub version with

remotes::install_dev("spanishoddata")

And to report any issues.

DESCRIPTION

R/available-data.R

R/download_data.R

R/available-data-s3.R

Robinlovelace · 2025-06-10T10:30:48Z

R/available-data-s3.R

+      etag = gsub('\\"', '', .data$ETag)
+    ) |>
+    dplyr::select(
+      .data$target_url,


Why not

Suggested change

.data$target_url,

target_url,

out of interest?

out of interest?

I guess you kind of have to https://dplyr.tidyverse.org/articles/programming.html#tidy-selection:~:text=(mpg)-,When%20you%20have%20an%20env%2Dvariable%20that%20is%20a%20character%20vector,-%2C%20you%20need%20to

This use case seems different, in our case target_url is the hardcoded column name, not a text string, right?

for (var in names(mtcars)) {
mtcars %>% count(.data[[var]]) %>% print()
}

I recall R CMD check complaining about undefined variables in functions and pointing to those tidy-eval column names, unless they are prepended by .data from rlang.

@Robinlovelace

for example, I forgot to fix the function below

spod_store_etags <- function() { available_data <- spod_available_data(1, check_local_files = TRUE) available_data <- available_data |> dplyr::filter(downloaded == TRUE) local_etags <- available_data$local_path |> purrr::map_chr(~ spod_compute_s3_etag(.x), .progress = TRUE) available_data <- available_data |> dplyr::mutate(local_etag = local_etags) |> dplyr::as_tibble() return(available_data) }

I get:

❯ checking R code for possible problems ... NOTE spod_store_etags: no visible binding for global variable ‘downloaded’ Undefined global functions or variables: downloaded

Then if I prepend downloaded with bangs(is that what they called...?)/exclamation marks:

dplyr::filter(!!downloaded == TRUE)

Same NOTE:

❯ checking R code for possible problems ... NOTE spod_store_etags: no visible binding for global variable ‘downloaded’ Undefined global functions or variables: downloaded

Then replacing the problematic line with:

dplyr::filter(.data$downloaded == TRUE)

No notes anymore 🤷‍♂️

Yeah, that's a fair reason for using the .data syntax.

I think messages like this

spod_store_etags: no visible binding for global variable ‘downloaded’

Can be resolved with utils::globalVariables("downloaded") somewhere in the package code, as outlined here https://forum.posit.co/t/how-to-solve-no-visible-binding-for-global-variable-note/28887/2 but in the very next post the .data syntax is recommended. Was just trying to understand the reasoning.

R/download_data.R

e-kotov added 3 commits May 16, 2025 15:18

use abs path instead of real path to avoid issues with network drives

6ea2b56

abs instead of real path in get data dir

f302f24

get v1 available data from s3 and fetch file sizes

da696ac

e-kotov added 2 commits May 17, 2025 10:51

v2 availalble data from S3 with file sizes and etags

da79487

reformat spod_download with air

e40d6bc

e-kotov added 14 commits May 17, 2025 15:00

check_local_files based on the file size check

99cee6d

s3 is default and is cached on disk

c9fe3fe

minor fixes to scoping

14bac0e

disable quick get live test

9afcd72

remove non ascii chars, scope tail to utils

dce2147

spod_download defaults to s3 metadata and checks local file sizes

6461eeb

update docs for download and available_data

11e1bc8

xml as failsafe for S3 and memoised xml load

bdccdf9

depend on paws instead of aws.s3 pkg, fix minor bugs in available dat…

6fb127c

…a fetching

depend on paws.storage explicitly to prevent too many deps of paws pa…

933ae6c

…ckage

disable region and url style env vars for S3, fixes in metadata fetch…

fdb506d

…ing with paws

air format internal utils

436f83d

custom multi file downloader because curl multi down fails a lot on s…

66b614b

…ome connections

drop curl dependency, only use base r downloads and custom function

9ce412d

e-kotov added 7 commits May 18, 2025 22:00

store v1 data metadata esp. true remote file sizes with the package

0f0d5dd

quick fix for requested_files

61ce5c5

make avail data s3 internal

3f65500

down speed and eta in progress bar

4d4b126

fix spod get zones v1 error

74fd2fc

fix progress ETA display

3aa7bda

show file names for files without data_ymd column

4fb8894

e-kotov requested review from Copilot and Robinlovelace May 21, 2025 12:39

Copilot AI reviewed May 21, 2025

View reviewed changes

e-kotov added 2 commits May 21, 2025 14:48

delete old v1 data size cache

7b9f61f

improve messaging of the check function

be64225

e-kotov marked this pull request as draft May 21, 2025 15:45

fix docs for spod_store_etags

7f5f87c

e-kotov marked this pull request as ready for review June 10, 2025 09:40

add file classification columns in avaialble data

77cb0f3

Robinlovelace approved these changes Jun 10, 2025

View reviewed changes

e-kotov added 5 commits June 13, 2025 18:17

improve error message for dowload larger then limit

f1fd5b1

add .data in spod_store_etags to prevent r cmd check notes

de0cfe2

update docs for spod_store_etags

c47ad3a

add warning to spod_check_files

44b24d2

docs cleanup

dfdc1ca

e-kotov merged commit 9d0ac9d into main Jun 13, 2025
5 checks passed

e-kotov deleted the s3-metadata branch June 13, 2025 17:07

This was referenced Jun 14, 2025

data download fails silently #127

Closed

validation of downloaded data #126

Closed

S3 metadata and file size checks #165

S3 metadata and file size checks #165

Uh oh!

Conversation

e-kotov commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

e-kotov commented May 16, 2025

Uh oh!

Robinlovelace commented May 16, 2025

Uh oh!

e-kotov commented May 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

e-kotov commented May 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

e-kotov commented May 21, 2025

New Features

Dependency Updates

Bug Fixes and Improvements

Developer Utilities

Uh oh!

e-kotov commented May 21, 2025

Uh oh!

Robinlovelace commented May 21, 2025

Uh oh!

e-kotov commented May 21, 2025

Uh oh!

e-kotov commented Jun 10, 2025

Uh oh!

Robinlovelace commented Jun 10, 2025

Uh oh!

Robinlovelace commented Jun 10, 2025

Uh oh!

Robinlovelace commented Jun 10, 2025

Uh oh!

Robinlovelace commented Jun 10, 2025

Uh oh!

Robinlovelace commented Jun 10, 2025

Uh oh!

e-kotov commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Robinlovelace left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Robinlovelace Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

e-kotov Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Robinlovelace Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

e-kotov Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

e-kotov Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Robinlovelace Jun 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

e-kotov commented May 16, 2025 •

edited

Loading

e-kotov commented May 17, 2025 •

edited

Loading

e-kotov commented May 18, 2025 •

edited

Loading

e-kotov commented Jun 10, 2025 •

edited

Loading

e-kotov Jun 13, 2025 •

edited

Loading