Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

error -> warn on archive access issues #4508

Merged

Conversation

ThomasBrady
Copy link
Contributor

@ThomasBrady ThomasBrady force-pushed the improve-history-download-logs-and-docs branch from 09b9c13 to f29b9f1 Compare October 21, 2024 18:49
@anupsdf anupsdf added this pull request to the merge queue Oct 21, 2024
Merged via the queue into stellar:master with commit 3c15129 Oct 21, 2024
13 checks passed
CLOG_ERROR(History,
"Failed to download ledger checkpoint {} from archive {}",
mFt->baseName_gz(), mArchive->getName());
CLOG_WARNING(History,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this remain an ERROR?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Marking it as ERROR might confuse the load operator, isn't it? Judging by the log here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this depends on the use case: CheckSingleLedgerHeaderWork is used for health reporting of the archives, and is invoked after checkpoint is supposed to be published. Which means that if something goes wrong, this should catch operator's attention (this is different from the other use case - online catchup - which optimistically queries potentially unpublished files).

We need to understand these failures better, and where it makes sense to warn. When doing archive self-checks, I do not think a warning is appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raised a PR to revert this change #4511

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try to understand what happened with the errors that prompted this fix though? This will help improve operator experience when it comes to archives. A few ideas:

  • CheckSingleLedgerHeaderWork is supposed to be called after checkpoint publishing started, but could we confirm that? If it doesn't behave as expected, we should fix it to avoid confusion among operators.
  • If the error in the linked issue is legit, we could verify that by curling the archive in question directly to verify that the file is indeed missing.
  • Assuming the archive is actually corrupt, i believe it should be ERROR but we should also indicate in the log message that core will try other archives, so this error does not necessarily merit a configuration change at the node level (but a reasonable action would be, for example, notifying operators running problematic archives that there are gaps)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CheckSingleLedgerHeaderWork is supposed to be called after checkpoint publishing started, but could we confirm that? If it doesn't behave as expected, we should fix it to avoid confusion among operators.

I believe that is correct. Application::scheduleSelfCheck(bool waitUntilNextCheckpoint) calls HistoryArchiveManager::getCheckLedgerHeaderWork which creates a CheckSingleLedgerHeaderWork for each archive. If waitUntilNextCheckpoint is true, this work execution is delayed until a 2nd full checkpoint period after the next checkpoint publication.

scheduleSelfCheck(waitUntilNextCheckpoint = true) is called when startServices executes and the automatic self check period is non-zero. scheduleSelfCheck sets a timer to re-execute itself every AUTOMATIC_SELF_CHECK_PERIOD seconds with waitUntilNextCheckpoint=true.

The only way I can see scheduleSelfCheck(waitUntilNextCheckpoint = false) is via the command with the stellar-core self-check command.

If the error in the linked issue is legit, we could verify that by curling the archive in question directly to verify that the file is indeed missing.

When I try to curl the archive, I get AccessDenied, specifically:

curl -sf https://stellar-archive-3-lobstr.s3.amazonaws.com/ledger/03/35/ae/ledger-0335aeff.xdr.gz -o foo.out
Fails silently (due to -sf).

curl https://stellar-archive-3-lobstr.s3.amazonaws.com/ledger/03/35/ae/ledger-0335aeff.xdr.gz -o foo.out
Writes an XML file with the access denied error:

<?xml version="1.0" encoding="UTF-8"?>
<Error><Code>AccessDenied</Code><Message>Access Denied</Message><RequestId>8Q8SJJ6ZFBTF51XV</RequestId><HostId>IFxG4XdLQHnPqLpb9va823UAed1W4EJUWnVIePVIlKuEC9sLUFgJ7PUyHvU93Hd3Ln9JmwZ5o58=</HostId></Error>

I can curl it from SDF's archives without issues.

Assuming the archive is actually corrupt, i believe it should be ERROR but we should also indicate in the log message that core will try other archives, so this error does not necessarily merit a configuration change at the node level (but a reasonable action would be, for example, notifying operators running problematic archives that there are gaps)

Yeah I agree this does seem to be a corrupt archive issue rather than a timing or transient network issue. I'll augment the error message with the explanation and assurance that core will continue to try other archives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the error message to:

"Failed to download ledger checkpoint {} from archive {}. This may "
            "be due to transient network issues, or the archive may be missing "
            "the checkpoint. stellar-core will try to download the file from "
            "another archive. This error *does not* necessarily require a "
            "configuration change to remove the archive. If you repeatedly see "
            "this error, please notify the operator of the archive that you "
            "are observing gaps."

in the revert PR.

Copy link
Contributor

@marta-lokhova marta-lokhova Oct 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it just that the archive URL is wrong? Stellarbeat points to https://archive.v3.stellar.lobstr.co

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah you are right, I can download that without issues. 🙃 Looks like FT has the wrong archive for lobstr in their config, how can we let them know?

I'll also add a note in the error to double check they have the right URL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marta-lokhova can let FT know about the incorrect URL

github-merge-queue bot pushed a commit that referenced this pull request Oct 23, 2024
# Description
Reverts downgrading of error to warning in `CheckSingleLedgerHeaderWork`

#4508 (comment)

<!---

Describe what this pull request does, which issue it's resolving
(usually applicable for code changes).

--->

# Checklist
- [ ] Reviewed the
[contributing](https://github.com/stellar/stellar-core/blob/master/CONTRIBUTING.md#submitting-changes)
document
- [ ] Rebased on top of master (no merge commits)
- [ ] Ran `clang-format` v8.0.0 (via `make format` or the Visual Studio
extension)
- [ ] Compiles
- [ ] Ran all tests
- [ ] If change impacts performance, include supporting evidence per the
[performance
document](https://github.com/stellar/stellar-core/blob/master/performance-eval/performance-eval.md)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants