Rebootstrap support - Phase 1 #5892

kfox1111 · 2025-02-22T14:19:26Z

Pull Request check list

Commit conforms to CONTRIBUTING.md?
Proper tests/regressions included?
Documentation updated?

Description of change
Allows the spire-agent to rebootstrap itself if the server ca goes invalid.

There will be a configurable timeout between the first seen invalid server certificate, to when it triggers a rebootstrap. Recovery of the server cert/ca during this time will prevent unneeded/undesired rebootstrapping of agents.

Which issue this PR fixes
partially fixes: #4624
fixes: #5893

Why still draft

configuration checks
- doesnt work with non reattestable attestors?
new flag to do just rebootstrap without resetting svid?
no tests

Signed-off-by: Kevin Fox <[email protected]>

amartinezfayo

Thank you, @kfox1111, for opening this PR!

Having three different flags (retry_bootstrap, rebootstrap, and rebootstrap_delay) to control bootstrapping/re-bootstrapping behavior seems confusing. I believe we can avoid having the rebootstrap_delay flag by choosing a safe default value that works well in most scenarios. This flag is particularly problematic because its validity depends on the rebootstrap flag. Instead of exposing it as a user-configurable option, SPIRE should infer an appropriate delay, unless we determine that tweaking this is essential for users.

We discussed making re-bootstrapping with retries the default behavior in SPIRE Agent during our maintainer's sync. One concern raised was that, particularly in Kubernetes environments, it is easier to detect an agent in a crash loop than to monitor logs for failures. However, given that we use the "WithMaxElapsedTime" backoff strategy (with a max elapsed time of one minute), agents would still crash after a minute (which seems like a relatively short duration to me).

I think that ideally, a single flag should enable or disable automatic re-bootstrapping retries, both at startup and when the agent stops recognizing the server’s authority. To move toward that goal, I propose the following plan:

In 1.12.0:

Introduce a new flag, that may be called automatic_rebootstrapping. When enabled, the agent retries bootstrapping at startup and re-bootstraps if it no longer recognizes the server’s authority when syncing authorized entries. We could set a maximum elapsed time before giving up (perhaps 5–10 minutes).
Deprecate the retry_bootstrap setting in favor of the new automatic_rebootstrapping setting, logging a warning when retry_bootstrap is used.

In 1.13.0:

Remove the retry_bootstrap setting.

What do you think?

@sorindumitru, I know you have thoughts on this, and I’d love to hear your perspective.

sorindumitru · 2025-03-10T15:45:50Z

@sorindumitru, I know you have thoughts on this, and I’d love to hear your perspective.

I think this all sounds good. We should also think about how to handle non-reattestable agents. There's two situations there:

The agent doesn't have an identity yet. We should retry, we don't even know if it's reattestable or not at this point.
The agent already has an identity. Retrying isn't likely to help in this case, but maybe we should retry anyway in case the agent gets evicted?

There's also some cases where we decide to remove the cached agent SVID. We should make sure we don't do that for non-reattestable agents.

kfox1111 · 2025-03-10T17:17:44Z

Hi @amartinezfayo. Thanks for the review.

Totally agree on minimizing the flags. I was surprised to see retry_bootstrap in there while coding, and filed #5896 to discuss removing it in the longer term.

I tried adding just one option, rebootstrap_delay, that enables rebootstrapping. But the cli command line parser library used can't make duration's optional, so had to add a second flag to gate it. Maybe there is a way to do that, that I missed? or maybe we don't support rebootstrap_delay on the cli? config file only would work for me.

As for picking a duration, I think that may be really hard to do for the user. Some folks may want it as an absolute last resort. more then a weekend so a human is involved?) or some may want it extremely fast. First time you see a failure, immediately reboostrap. And probably a lot in between.

There's levels of safety in the system:

server cert in the established trust bundle - very very unlikely there is a security issue
bootstrapping - happens exceedly infrequently, and often with a sysadmin involved in the process, so there is actively looking for funny business.
rebootstrapping - happens potentially at any time. harder to protect against badness. Kind of up to each organization to decide the tradeoff between system unavailability and risk of recovery from a compromised server I think.

As for k8s support, I totally get that. Doing most things in k8s myself. Though I think the same kind of thing can be handled by a k8s readiness probe. Mark the pods unready if not attested/reattested. Then they can be alerted on via normal k8s means (pods stuck unready for more then n minutes) but still leave control in the hands of spire-agent for when to (re)attest. It may require some changes between liveness and readyness probes too. (not ready for a while doesnt mean liveness should fire and kill it.)

To get this working well, I'm guessing a timeout with WithMaxElapsedTime cant be as short as a minute. It would require multiple calls to the external url for trust bundle fetching in short order, and with lots of agents doing it, could lead to a thundering herd issue. I think the readiness thing could cover making the MaxElapsedTime much longer.

@sorindumitru, thank you too for the review

There's two situations there:

The agent doesn't have an identity yet. We should retry, we don't even know if it's reattestable or not at this point.

The agent already has an identity. Retrying isn't likely to help in this case, but maybe we should retry anyway in case the agent gets evicted?

For 1, the pr here does that.

For 2, I'm on the fence. If the server trust is broken, the agent is broken. Reboostrapping might not fully work, but still maybe better then just throwing x509 errors? I think there are probably some ways to Rebootstrap, then Reattest with the node cert anyway, with some extra code. Maybe that discussion should wait for a later phase though? A little worried this PR is already complicated enough without that kind of thing. Maybe we just ignore the rebootstrap flag if reattestable == false for now, or throw an error on start in that case and revist later?

Rebootstrap support - Phase 1

2962b75

Signed-off-by: Kevin Fox <[email protected]>

kfox1111 requested review from evan2645, amartinezfayo, sorindumitru, MarcosDY and rturner3 as code owners February 22, 2025 14:19

kfox1111 marked this pull request as draft February 22, 2025 14:19

kfox1111 added 12 commits February 23, 2025 17:04

Start extracting out the trust bundle fetching code

19dd73f

Signed-off-by: Kevin Fox <[email protected]>

Move bundle load logic to retry loop

4924fba

Signed-off-by: Kevin Fox <[email protected]>

More TrustBundleSources rework. Save state.

f97e87e

Signed-off-by: Kevin Fox <[email protected]>

Merge branch 'main' into rebootstrap

fb49c17

Fix lint

577e13f

Signed-off-by: Kevin Fox <[email protected]>

Fix lint issues

a117831

Signed-off-by: Kevin Fox <[email protected]>

Don't get rid of retry bootstrap for now.

5da19a7

Signed-off-by: Kevin Fox <[email protected]>

Add configuration options

1847fe1

Signed-off-by: Kevin Fox <[email protected]>

Manage insecure bootstrap flag via the trust bundle sources code

af2c4c7

Signed-off-by: Kevin Fox <[email protected]>

Fix some ws issues

88000ee

Signed-off-by: Kevin Fox <[email protected]>

Fix some issues

da6cf37

Signed-off-by: Kevin Fox <[email protected]>

Fix lint issues

fc9b632

Signed-off-by: Kevin Fox <[email protected]>

MarcosDY assigned amartinezfayo Feb 27, 2025

amartinezfayo reviewed Mar 10, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rebootstrap support - Phase 1 #5892

Rebootstrap support - Phase 1 #5892

kfox1111 commented Feb 22, 2025 •

edited

Loading

amartinezfayo left a comment

sorindumitru commented Mar 10, 2025

kfox1111 commented Mar 10, 2025

Rebootstrap support - Phase 1 #5892

Are you sure you want to change the base?

Rebootstrap support - Phase 1 #5892

Conversation

kfox1111 commented Feb 22, 2025 • edited Loading

amartinezfayo left a comment

Choose a reason for hiding this comment

sorindumitru commented Mar 10, 2025

kfox1111 commented Mar 10, 2025

kfox1111 commented Feb 22, 2025 •

edited

Loading