Skip to content

Conversation

cboettig
Copy link
Contributor

Previously, function erroneously assumed that XML response always had <Content> elements listed last, and merely used tail() to get the Key of the last element for use as NextMarker.

This assumption relying on ordering rather than explicitly using element names fails when clients (Redhat's CEPH S3 interface, an open source product widely used in research data centers) return XML that lists metadata fields. Moreover, the S3 response actually provides a field called NextMarker for this very purpose, which ought to be used instead of attempting to find the last key. My pull request simply updates the code to use r$NextMarker to find the NextMarker instead of using tail(r, 1)[["Contents"]][["Key"]])

Please ensure the following before submitting a PR:

  • if suggesting code changes or improvements, open an issue first
  • for all but trivial changes (e.g., typo fixes), add your name to DESCRIPTION (already a contributor)
  • for all but trivial changes (e.g., typo fixes), documentation your change in NEWS.md with a parenthetical reference to the issue number being addressed
  • if changing documentation, edit files in /R not /man and run devtools::document() to update documentation
  • add code or new test files to /tests for any new functionality or bug fix
  • make sure R CMD check runs without error before submitting the PR

Previously, function erroneously assumed that XML response always had <Content> elements listed last, and merely used `tail()` to get the Key of the last element for use as NextMarker.

This assumption relying on ordering rather than explicitly using element names fails when clients (CEPH S3 interface) return XML that lists metadata fields.  Moreover, the S3 response actually provides a field called NextMarker for this very purpose, which ought to be used instead of attempting to find the last key.

Later, the code again used `tail` erroneously depending on order, rather than merely copying over the NextMarker metadata field onto the expanding concatanated list.
@raymondben
Copy link

@s-u any chance this PR can be merged please? The problem still exists as described and @cboettig's solution (this PR) works on the data sets I am using from Copernicus Marine. Without this PR, get_bucket puts itself into an infinite loop (tail(r, 1)[["Contents"]][["Key"]] returns NULL and so rather than paging through results, the loop ends up downloading the first 1000 records over and over again). Thanks!

@s-u
Copy link
Member

s-u commented Aug 2, 2025

@raymondben Sure - but it would help if 1) there was some information on what the problem is (the above talks about the code but not actually how to trigger the problem/where this matters) and how it can be replicated and 2) confirm that the PR works and doesn't break anything else? It seems that you may have the material for both.

@raymondben
Copy link

Thanks @s-u - fair points, and yes I can go some way towards addressing those. Thanks, stand by ...

@raymondben
Copy link

Here is a reproducible example:

lx <- aws.s3::get_bucket_df(bucket = "mdl-native-07", prefix = "native/SEALEVEL_GLO_PHY_L4_MY_008_047/cmems_obs-sl_glo_phy-ssh_my_allsat-l4-duacs-0.125deg_P1D_202411/", region = "", base_url = "s3.waw3-1.cloudferro.com", max = 2000)

all.equal(lx[1:1000, ], lx[1001:2000, ], check.attributes = FALSE)
[1] TRUE

sum(duplicated(lx$Key))
[1] 1000

Instead of listing the first 2000 objects, we are getting two copies of the first 1000 objects. This happens because the current get_bucket code uses the last Key value in a page of results (marker = tail(r, 1)[["Contents"]][["Key"]]) to identify the starting point for the request for the next page of results (https://github.com/snystrom/aws.s3/blob/master/R/get_bucket.R#L56).

But in this example, that Key value is NULL:

r <- aws.s3:::parse_aws_s3_response(
  aws.s3::get_bucket(bucket = "mdl-native-07", prefix = "native/SEALEVEL_GLO_PHY_L4_MY_008_047/cmems_obs-sl_glo_phy-ssh_my_allsat-l4-duacs-0.125deg_P1D_202411/", region = "", base_url = "s3.waw3-1.cloudferro.com", max = 2000, parse_response = FALSE))

tail(r, 1)[["Contents"]][["Key"]]
NULL

Instead we can use the NextMarker key for this purpose, see https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html#API_ListObjects_ResponseSyntax:

NextMarker

When the response is truncated (the IsTruncated element value in the response is true), you can use the key name in this field as the marker parameter in the subsequent request to get the next set of objects. Amazon S3 lists objects in alphabetical order.

Note

This element is returned only if you have the delimiter request parameter specified. If the response does not include the NextMarker element and it is truncated, you can use the value of the last Key element in the response as the marker parameter in the subsequent request to get the next set of object keys.

In our above example:

r$NextMarker
[1] "native/SEALEVEL_GLO_PHY_L4_MY_008_047/cmems_obs-sl_glo_phy-ssh_my_allsat-l4-duacs-0.125deg_P1D_202411/1995/09/dt_global_allsat_phy_l4_19950927_20241015.nc"

Which solves the problem in that case. But (and see the note in the doc snippet above) the NextMarker is not always returned. See for example results from a different provider:

r <- aws.s3:::parse_aws_s3_response(
  aws.s3::get_bucket(bucket = "", prefix = "", region = "", base_url = "noaa-dcdb-bathymetry-pds.s3.amazonaws.com", max = 2000, parse_response = FALSE))

r$NextMarker
NULL

But the existing aws.s3::get_bucket code works in this case:

tail(r, 1)[["Contents"]][["Key"]]
[1] "csb/csv/2017/04/26/20170426115438456274_dca9b61b-9864-4da7-a713-760ff8c0658f_pointData.csv"

So I would suggest modifying this PR slightly, so that the existing https://github.com/snystrom/aws.s3/blob/master/R/get_bucket.R#L56 is changed from

marker = tail(r, 1)[["Contents"]][["Key"]]

to

marker = if (!is.null(tail(r, 1)[["Contents"]][["Key"]])) tail(r, 1)[["Contents"]][["Key"]] else r$NextMarker

(i.e. retain the existing behaviour when it works, otherwise use r$NextMarker. That should ensure that nothing that currently works gets broken by this change.)

@s-u
Copy link
Member

s-u commented Aug 18, 2025

@raymondben many thanks for providing the examples! The core issue was that tail(r,1) is a bad idea since we only need the last key so it only worked if the last entry was Contents which is not guaranteed, so the correct way is to use tail(r[names(r) == "Contents"], 1)[["Contents"]][["Key"]] instead. That one works in all cases, but just for consistency I have added the NextMarker support as well if present. FWIW in both your examples the last key approach works, but I guess it doesn't hurt to use NextMarker if it's present - it does match the last key so it is consistent.

@s-u
Copy link
Member

s-u commented Aug 18, 2025

There is one more issue:

new_r <- c(r, tail(extra, -5))
is wrong as well since it assumes exact presence/absence of leading fields - which is not as expected in the NextToken case - will look at it tomorrow

s-u added a commit that referenced this pull request Aug 18, 2025
@s-u
Copy link
Member

s-u commented Aug 18, 2025

That one should be fixed now as well.

@raymondben
Copy link

Champion! Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants