Use NextMarker instead of tail() to find NextMarker #414

cboettig · 2022-03-31T17:27:25Z

Previously, function erroneously assumed that XML response always had <Content> elements listed last, and merely used tail() to get the Key of the last element for use as NextMarker.

This assumption relying on ordering rather than explicitly using element names fails when clients (Redhat's CEPH S3 interface, an open source product widely used in research data centers) return XML that lists metadata fields. Moreover, the S3 response actually provides a field called NextMarker for this very purpose, which ought to be used instead of attempting to find the last key. My pull request simply updates the code to use r$NextMarker to find the NextMarker instead of using tail(r, 1)[["Contents"]][["Key"]])

Please ensure the following before submitting a PR:

if suggesting code changes or improvements, open an issue first
for all but trivial changes (e.g., typo fixes), add your name to DESCRIPTION (already a contributor)
for all but trivial changes (e.g., typo fixes), documentation your change in NEWS.md with a parenthetical reference to the issue number being addressed
if changing documentation, edit files in /R not /man and run devtools::document() to update documentation
add code or new test files to /tests for any new functionality or bug fix
make sure R CMD check runs without error before submitting the PR

Previously, function erroneously assumed that XML response always had <Content> elements listed last, and merely used `tail()` to get the Key of the last element for use as NextMarker. This assumption relying on ordering rather than explicitly using element names fails when clients (CEPH S3 interface) return XML that lists metadata fields. Moreover, the S3 response actually provides a field called NextMarker for this very purpose, which ought to be used instead of attempting to find the last key. Later, the code again used `tail` erroneously depending on order, rather than merely copying over the NextMarker metadata field onto the expanding concatanated list.

raymondben · 2025-08-01T04:23:20Z

@s-u any chance this PR can be merged please? The problem still exists as described and @cboettig's solution (this PR) works on the data sets I am using from Copernicus Marine. Without this PR, get_bucket puts itself into an infinite loop (tail(r, 1)[["Contents"]][["Key"]] returns NULL and so rather than paging through results, the loop ends up downloading the first 1000 records over and over again). Thanks!

s-u · 2025-08-02T15:22:28Z

@raymondben Sure - but it would help if 1) there was some information on what the problem is (the above talks about the code but not actually how to trigger the problem/where this matters) and how it can be replicated and 2) confirm that the PR works and doesn't break anything else? It seems that you may have the material for both.

raymondben · 2025-08-04T03:26:49Z

Thanks @s-u - fair points, and yes I can go some way towards addressing those. Thanks, stand by ...

raymondben · 2025-08-06T00:27:31Z

Here is a reproducible example:

lx <- aws.s3::get_bucket_df(bucket = "mdl-native-07", prefix = "native/SEALEVEL_GLO_PHY_L4_MY_008_047/cmems_obs-sl_glo_phy-ssh_my_allsat-l4-duacs-0.125deg_P1D_202411/", region = "", base_url = "s3.waw3-1.cloudferro.com", max = 2000)

all.equal(lx[1:1000, ], lx[1001:2000, ], check.attributes = FALSE)
[1] TRUE

sum(duplicated(lx$Key))
[1] 1000

Instead of listing the first 2000 objects, we are getting two copies of the first 1000 objects. This happens because the current get_bucket code uses the last Key value in a page of results (marker = tail(r, 1)[["Contents"]][["Key"]]) to identify the starting point for the request for the next page of results (https://github.com/snystrom/aws.s3/blob/master/R/get_bucket.R#L56).

But in this example, that Key value is NULL:

r <- aws.s3:::parse_aws_s3_response(
  aws.s3::get_bucket(bucket = "mdl-native-07", prefix = "native/SEALEVEL_GLO_PHY_L4_MY_008_047/cmems_obs-sl_glo_phy-ssh_my_allsat-l4-duacs-0.125deg_P1D_202411/", region = "", base_url = "s3.waw3-1.cloudferro.com", max = 2000, parse_response = FALSE))

tail(r, 1)[["Contents"]][["Key"]]
NULL

Instead we can use the NextMarker key for this purpose, see https://docs.aws.amazon.com/AmazonS3/latest/API/API_ListObjects.html#API_ListObjects_ResponseSyntax:

NextMarker

When the response is truncated (the IsTruncated element value in the response is true), you can use the key name in this field as the marker parameter in the subsequent request to get the next set of objects. Amazon S3 lists objects in alphabetical order.

Note

This element is returned only if you have the delimiter request parameter specified. If the response does not include the NextMarker element and it is truncated, you can use the value of the last Key element in the response as the marker parameter in the subsequent request to get the next set of object keys.

In our above example:

r$NextMarker
[1] "native/SEALEVEL_GLO_PHY_L4_MY_008_047/cmems_obs-sl_glo_phy-ssh_my_allsat-l4-duacs-0.125deg_P1D_202411/1995/09/dt_global_allsat_phy_l4_19950927_20241015.nc"

Which solves the problem in that case. But (and see the note in the doc snippet above) the NextMarker is not always returned. See for example results from a different provider:

r <- aws.s3:::parse_aws_s3_response(
  aws.s3::get_bucket(bucket = "", prefix = "", region = "", base_url = "noaa-dcdb-bathymetry-pds.s3.amazonaws.com", max = 2000, parse_response = FALSE))

r$NextMarker
NULL

But the existing aws.s3::get_bucket code works in this case:

tail(r, 1)[["Contents"]][["Key"]]
[1] "csb/csv/2017/04/26/20170426115438456274_dca9b61b-9864-4da7-a713-760ff8c0658f_pointData.csv"

So I would suggest modifying this PR slightly, so that the existing https://github.com/snystrom/aws.s3/blob/master/R/get_bucket.R#L56 is changed from

marker = tail(r, 1)[["Contents"]][["Key"]]

to

marker = if (!is.null(tail(r, 1)[["Contents"]][["Key"]])) tail(r, 1)[["Contents"]][["Key"]] else r$NextMarker

(i.e. retain the existing behaviour when it works, otherwise use r$NextMarker. That should ensure that nothing that currently works gets broken by this change.)

s-u · 2025-08-18T03:05:44Z

@raymondben many thanks for providing the examples! The core issue was that tail(r,1) is a bad idea since we only need the last key so it only worked if the last entry was Contents which is not guaranteed, so the correct way is to use tail(r[names(r) == "Contents"], 1)[["Contents"]][["Key"]] instead. That one works in all cases, but just for consistency I have added the NextMarker support as well if present. FWIW in both your examples the last key approach works, but I guess it doesn't hurt to use NextMarker if it's present - it does match the last key so it is consistent.

s-u · 2025-08-18T04:33:48Z

There is one more issue:

aws.s3/R/get_bucket.R

Line 59 in 3e635ab

new_r <- c(r, tail(extra, -5))

is wrong as well since it assumes exact presence/absence of leading fields - which is not as expected in the NextToken case - will look at it tomorrow

…related to #414)

s-u · 2025-08-18T22:59:10Z

That one should be fixed now as well.

raymondben · 2025-08-18T23:50:10Z

Champion! Thanks.

cboettig added 2 commits March 31, 2022 17:12

update NEWS

c710013

raymondben mentioned this pull request Aug 6, 2025

Misplaced parameter in cms_list_native_files? pepijn-devries/CopernicusMarine#65

Closed

s-u closed this in 3e635ab Aug 18, 2025

s-u added a commit that referenced this pull request Aug 18, 2025

fix incremental pulls in get_bucket to not assume exact field order (…

b88b976

…related to #414)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use NextMarker instead of tail() to find NextMarker #414

Use NextMarker instead of tail() to find NextMarker #414

Uh oh!

cboettig commented Mar 31, 2022

Uh oh!

raymondben commented Aug 1, 2025

Uh oh!

s-u commented Aug 2, 2025

Uh oh!

raymondben commented Aug 4, 2025

Uh oh!

raymondben commented Aug 6, 2025

Uh oh!

s-u commented Aug 18, 2025 •

edited

Loading

Uh oh!

s-u commented Aug 18, 2025

Uh oh!

s-u commented Aug 18, 2025

Uh oh!

raymondben commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Use NextMarker instead of tail() to find NextMarker #414

Use NextMarker instead of tail() to find NextMarker #414

Uh oh!

Conversation

cboettig commented Mar 31, 2022

Uh oh!

raymondben commented Aug 1, 2025

Uh oh!

s-u commented Aug 2, 2025

Uh oh!

raymondben commented Aug 4, 2025

Uh oh!

raymondben commented Aug 6, 2025

Uh oh!

s-u commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

s-u commented Aug 18, 2025

Uh oh!

s-u commented Aug 18, 2025

Uh oh!

raymondben commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

s-u commented Aug 18, 2025 •

edited

Loading