-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rebootstrap support - Phase 1 #5892
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
Signed-off-by: Kevin Fox <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, @kfox1111, for opening this PR!
Having three different flags (retry_bootstrap
, rebootstrap
, and rebootstrap_delay
) to control bootstrapping/re-bootstrapping behavior seems confusing. I believe we can avoid having the rebootstrap_delay
flag by choosing a safe default value that works well in most scenarios. This flag is particularly problematic because its validity depends on the rebootstrap
flag. Instead of exposing it as a user-configurable option, SPIRE should infer an appropriate delay, unless we determine that tweaking this is essential for users.
We discussed making re-bootstrapping with retries the default behavior in SPIRE Agent during our maintainer's sync. One concern raised was that, particularly in Kubernetes environments, it is easier to detect an agent in a crash loop than to monitor logs for failures. However, given that we use the "WithMaxElapsedTime" backoff strategy (with a max elapsed time of one minute), agents would still crash after a minute (which seems like a relatively short duration to me).
I think that ideally, a single flag should enable or disable automatic re-bootstrapping retries, both at startup and when the agent stops recognizing the server’s authority. To move toward that goal, I propose the following plan:
In 1.12.0:
- Introduce a new flag, that may be called
automatic_rebootstrapping
. When enabled, the agent retries bootstrapping at startup and re-bootstraps if it no longer recognizes the server’s authority when syncing authorized entries. We could set a maximum elapsed time before giving up (perhaps 5–10 minutes). - Deprecate the
retry_bootstrap setting
in favor of the newautomatic_rebootstrapping
setting, logging a warning whenretry_bootstrap
is used.
In 1.13.0:
- Remove the
retry_bootstrap
setting.
What do you think?
@sorindumitru, I know you have thoughts on this, and I’d love to hear your perspective.
I think this all sounds good. We should also think about how to handle non-reattestable agents. There's two situations there:
There's also some cases where we decide to remove the cached agent SVID. We should make sure we don't do that for non-reattestable agents. |
Hi @amartinezfayo. Thanks for the review. Totally agree on minimizing the flags. I was surprised to see retry_bootstrap in there while coding, and filed #5896 to discuss removing it in the longer term. I tried adding just one option, rebootstrap_delay, that enables rebootstrapping. But the cli command line parser library used can't make duration's optional, so had to add a second flag to gate it. Maybe there is a way to do that, that I missed? or maybe we don't support rebootstrap_delay on the cli? config file only would work for me. As for picking a duration, I think that may be really hard to do for the user. Some folks may want it as an absolute last resort. more then a weekend so a human is involved?) or some may want it extremely fast. First time you see a failure, immediately reboostrap. And probably a lot in between. There's levels of safety in the system:
As for k8s support, I totally get that. Doing most things in k8s myself. Though I think the same kind of thing can be handled by a k8s readiness probe. Mark the pods unready if not attested/reattested. Then they can be alerted on via normal k8s means (pods stuck unready for more then n minutes) but still leave control in the hands of spire-agent for when to (re)attest. It may require some changes between liveness and readyness probes too. (not ready for a while doesnt mean liveness should fire and kill it.) To get this working well, I'm guessing a timeout with WithMaxElapsedTime cant be as short as a minute. It would require multiple calls to the external url for trust bundle fetching in short order, and with lots of agents doing it, could lead to a thundering herd issue. I think the readiness thing could cover making the MaxElapsedTime much longer. @sorindumitru, thank you too for the review
For 1, the pr here does that. For 2, I'm on the fence. If the server trust is broken, the agent is broken. Reboostrapping might not fully work, but still maybe better then just throwing x509 errors? I think there are probably some ways to Rebootstrap, then Reattest with the node cert anyway, with some extra code. Maybe that discussion should wait for a later phase though? A little worried this PR is already complicated enough without that kind of thing. Maybe we just ignore the rebootstrap flag if reattestable == false for now, or throw an error on start in that case and revist later? |
Pull Request check list
Description of change
Allows the spire-agent to rebootstrap itself if the server ca goes invalid.
There will be a configurable timeout between the first seen invalid server certificate, to when it triggers a rebootstrap. Recovery of the server cert/ca during this time will prevent unneeded/undesired rebootstrapping of agents.
Which issue this PR fixes
partially fixes: #4624
fixes: #5893
Why still draft