error -> warn on archive access issues #4508

ThomasBrady · 2024-10-18T21:10:12Z

Description

Resolves https://github.com/stellar/stellar-core-internal/issues/295

marta-lokhova · 2024-10-21T20:15:26Z

src/historywork/CheckSingleLedgerHeaderWork.cpp

-        CLOG_ERROR(History,
-                   "Failed to download ledger checkpoint {} from archive {}",
-                   mFt->baseName_gz(), mArchive->getName());
+        CLOG_WARNING(History,


shouldn't this remain an ERROR?

Marking it as ERROR might confuse the load operator, isn't it? Judging by the log here.

this depends on the use case: CheckSingleLedgerHeaderWork is used for health reporting of the archives, and is invoked after checkpoint is supposed to be published. Which means that if something goes wrong, this should catch operator's attention (this is different from the other use case - online catchup - which optimistically queries potentially unpublished files).

We need to understand these failures better, and where it makes sense to warn. When doing archive self-checks, I do not think a warning is appropriate.

Raised a PR to revert this change #4511

Could we try to understand what happened with the errors that prompted this fix though? This will help improve operator experience when it comes to archives. A few ideas:

CheckSingleLedgerHeaderWork is supposed to be called after checkpoint publishing started, but could we confirm that? If it doesn't behave as expected, we should fix it to avoid confusion among operators.

If the error in the linked issue is legit, we could verify that by curling the archive in question directly to verify that the file is indeed missing.

Assuming the archive is actually corrupt, i believe it should be ERROR but we should also indicate in the log message that core will try other archives, so this error does not necessarily merit a configuration change at the node level (but a reasonable action would be, for example, notifying operators running problematic archives that there are gaps)

CheckSingleLedgerHeaderWork is supposed to be called after checkpoint publishing started, but could we confirm that? If it doesn't behave as expected, we should fix it to avoid confusion among operators.

I believe that is correct. Application::scheduleSelfCheck(bool waitUntilNextCheckpoint) calls HistoryArchiveManager::getCheckLedgerHeaderWork which creates a CheckSingleLedgerHeaderWork for each archive. If waitUntilNextCheckpoint is true, this work execution is delayed until a 2nd full checkpoint period after the next checkpoint publication.

scheduleSelfCheck(waitUntilNextCheckpoint = true) is called when startServices executes and the automatic self check period is non-zero. scheduleSelfCheck sets a timer to re-execute itself every AUTOMATIC_SELF_CHECK_PERIOD seconds with waitUntilNextCheckpoint=true.

The only way I can see scheduleSelfCheck(waitUntilNextCheckpoint = false) is via the command with the stellar-core self-check command.

If the error in the linked issue is legit, we could verify that by curling the archive in question directly to verify that the file is indeed missing.

When I try to curl the archive, I get AccessDenied, specifically:

curl -sf https://stellar-archive-3-lobstr.s3.amazonaws.com/ledger/03/35/ae/ledger-0335aeff.xdr.gz -o foo.out
Fails silently (due to -sf).

curl https://stellar-archive-3-lobstr.s3.amazonaws.com/ledger/03/35/ae/ledger-0335aeff.xdr.gz -o foo.out
Writes an XML file with the access denied error:

<?xml version="1.0" encoding="UTF-8"?> <Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>8Q8SJJ6ZFBTF51XV</RequestId><HostId>IFxG4XdLQHnPqLpb9va823UAed1W4EJUWnVIePVIlKuEC9sLUFgJ7PUyHvU93Hd3Ln9JmwZ5o58=</HostId></Error>

I can curl it from SDF's archives without issues.

Assuming the archive is actually corrupt, i believe it should be ERROR but we should also indicate in the log message that core will try other archives, so this error does not necessarily merit a configuration change at the node level (but a reasonable action would be, for example, notifying operators running problematic archives that there are gaps)

Yeah I agree this does seem to be a corrupt archive issue rather than a timing or transient network issue. I'll augment the error message with the explanation and assurance that core will continue to try other archives.

Updated the error message to:

"Failed to download ledger checkpoint {} from archive {}. This may " "be due to transient network issues, or the archive may be missing " "the checkpoint. stellar-core will try to download the file from " "another archive. This error *does not* necessarily require a " "configuration change to remove the archive. If you repeatedly see " "this error, please notify the operator of the archive that you " "are observing gaps."

in the revert PR.

Is it just that the archive URL is wrong? Stellarbeat points to https://archive.v3.stellar.lobstr.co

Ah yeah you are right, I can download that without issues. 🙃 Looks like FT has the wrong archive for lobstr in their config, how can we let them know?

I'll also add a note in the error to double check they have the right URL.

@marta-lokhova can let FT know about the incorrect URL

# Description Reverts downgrading of error to warning in `CheckSingleLedgerHeaderWork` #4508 (comment)  # Checklist - [ ] Reviewed the [contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes) document - [ ] Rebased on top of master (no merge commits) - [ ] Ran `clang-format` v8.0.0 (via `make format` or the Visual Studio extension) - [ ] Compiles - [ ] Ran all tests - [ ] If change impacts performance, include supporting evidence per the [performance document](https://github.com/stellar/stellar-core/blob/master/performance-eval/performance-eval.md)

error -> Warn on archive access issues

f29b9f1

ThomasBrady force-pushed the improve-history-download-logs-and-docs branch from 09b9c13 to f29b9f1 Compare October 21, 2024 18:49

ThomasBrady requested a review from anupsdf October 21, 2024 18:52

anupsdf approved these changes Oct 21, 2024

View reviewed changes

anupsdf enabled auto-merge October 21, 2024 19:16

anupsdf added this pull request to the merge queue Oct 21, 2024

Merged via the queue into stellar:master with commit 3c15129 Oct 21, 2024
13 checks passed

marta-lokhova reviewed Oct 21, 2024

View reviewed changes

ThomasBrady mentioned this pull request Oct 21, 2024

Revert downgrading of log error->warn->error #4511

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

error -> warn on archive access issues #4508

error -> warn on archive access issues #4508

ThomasBrady commented Oct 18, 2024

marta-lokhova Oct 21, 2024

anupsdf Oct 21, 2024

marta-lokhova Oct 21, 2024

ThomasBrady Oct 21, 2024

marta-lokhova Oct 22, 2024

ThomasBrady Oct 22, 2024

ThomasBrady Oct 22, 2024

marta-lokhova Oct 22, 2024 •

edited

Loading

ThomasBrady Oct 22, 2024

anupsdf Oct 23, 2024

error -> warn on archive access issues #4508

error -> warn on archive access issues #4508

Conversation

ThomasBrady commented Oct 18, 2024

Description

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marta-lokhova Oct 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marta-lokhova Oct 22, 2024 •

edited

Loading