Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rebootstrap support - Phase 1 #5892

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from
Draft

Conversation

kfox1111
Copy link
Contributor

@kfox1111 kfox1111 commented Feb 22, 2025

Pull Request check list

  • Commit conforms to CONTRIBUTING.md?
  • Proper tests/regressions included?
  • Documentation updated?

Description of change
Allows the spire-agent to rebootstrap itself if the server ca goes invalid.

There will be a configurable timeout between the first seen invalid server certificate, to when it triggers a rebootstrap. Recovery of the server cert/ca during this time will prevent unneeded/undesired rebootstrapping of agents.

Which issue this PR fixes
partially fixes: #4624
fixes: #5893

Why still draft

  • configuration checks
    • doesnt work with non reattestable attestors?
  • new flag to do just rebootstrap without resetting svid?
  • no tests

Copy link
Member

@amartinezfayo amartinezfayo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, @kfox1111, for opening this PR!

Having three different flags (retry_bootstrap, rebootstrap, and rebootstrap_delay) to control bootstrapping/re-bootstrapping behavior seems confusing. I believe we can avoid having the rebootstrap_delay flag by choosing a safe default value that works well in most scenarios. This flag is particularly problematic because its validity depends on the rebootstrap flag. Instead of exposing it as a user-configurable option, SPIRE should infer an appropriate delay, unless we determine that tweaking this is essential for users.

We discussed making re-bootstrapping with retries the default behavior in SPIRE Agent during our maintainer's sync. One concern raised was that, particularly in Kubernetes environments, it is easier to detect an agent in a crash loop than to monitor logs for failures. However, given that we use the "WithMaxElapsedTime" backoff strategy (with a max elapsed time of one minute), agents would still crash after a minute (which seems like a relatively short duration to me).

I think that ideally, a single flag should enable or disable automatic re-bootstrapping retries, both at startup and when the agent stops recognizing the server’s authority. To move toward that goal, I propose the following plan:

In 1.12.0:

  • Introduce a new flag, that may be called automatic_rebootstrapping. When enabled, the agent retries bootstrapping at startup and re-bootstraps if it no longer recognizes the server’s authority when syncing authorized entries. We could set a maximum elapsed time before giving up (perhaps 5–10 minutes).
  • Deprecate the retry_bootstrap setting in favor of the new automatic_rebootstrapping setting, logging a warning when retry_bootstrap is used.

In 1.13.0:

  • Remove the retry_bootstrap setting.

What do you think?

@sorindumitru, I know you have thoughts on this, and I’d love to hear your perspective.

@sorindumitru
Copy link
Collaborator

@sorindumitru, I know you have thoughts on this, and I’d love to hear your perspective.

I think this all sounds good. We should also think about how to handle non-reattestable agents. There's two situations there:

  • The agent doesn't have an identity yet. We should retry, we don't even know if it's reattestable or not at this point.
  • The agent already has an identity. Retrying isn't likely to help in this case, but maybe we should retry anyway in case the agent gets evicted?

There's also some cases where we decide to remove the cached agent SVID. We should make sure we don't do that for non-reattestable agents.

@kfox1111
Copy link
Contributor Author

Hi @amartinezfayo. Thanks for the review.

Totally agree on minimizing the flags. I was surprised to see retry_bootstrap in there while coding, and filed #5896 to discuss removing it in the longer term.

I tried adding just one option, rebootstrap_delay, that enables rebootstrapping. But the cli command line parser library used can't make duration's optional, so had to add a second flag to gate it. Maybe there is a way to do that, that I missed? or maybe we don't support rebootstrap_delay on the cli? config file only would work for me.

As for picking a duration, I think that may be really hard to do for the user. Some folks may want it as an absolute last resort. more then a weekend so a human is involved?) or some may want it extremely fast. First time you see a failure, immediately reboostrap. And probably a lot in between.

There's levels of safety in the system:

  1. server cert in the established trust bundle - very very unlikely there is a security issue
  2. bootstrapping - happens exceedly infrequently, and often with a sysadmin involved in the process, so there is actively looking for funny business.
  3. rebootstrapping - happens potentially at any time. harder to protect against badness. Kind of up to each organization to decide the tradeoff between system unavailability and risk of recovery from a compromised server I think.

As for k8s support, I totally get that. Doing most things in k8s myself. Though I think the same kind of thing can be handled by a k8s readiness probe. Mark the pods unready if not attested/reattested. Then they can be alerted on via normal k8s means (pods stuck unready for more then n minutes) but still leave control in the hands of spire-agent for when to (re)attest. It may require some changes between liveness and readyness probes too. (not ready for a while doesnt mean liveness should fire and kill it.)

To get this working well, I'm guessing a timeout with WithMaxElapsedTime cant be as short as a minute. It would require multiple calls to the external url for trust bundle fetching in short order, and with lots of agents doing it, could lead to a thundering herd issue. I think the readiness thing could cover making the MaxElapsedTime much longer.

@sorindumitru, thank you too for the review

There's two situations there:

  • The agent doesn't have an identity yet. We should retry, we don't even know if it's reattestable or not at this point.
  • The agent already has an identity. Retrying isn't likely to help in this case, but maybe we should retry anyway in case the agent gets evicted?

For 1, the pr here does that.

For 2, I'm on the fence. If the server trust is broken, the agent is broken. Reboostrapping might not fully work, but still maybe better then just throwing x509 errors? I think there are probably some ways to Rebootstrap, then Reattest with the node cert anyway, with some extra code. Maybe that discussion should wait for a later phase though? A little worried this PR is already complicated enough without that kind of thing. Maybe we just ignore the rebootstrap flag if reattestable == false for now, or throw an error on start in that case and revist later?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

trust_bundle_url contact on agent startup spire-agent: allow automatic re-bootstrap
3 participants