Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rfc: distributed coordinator #1078
base: main
Are you sure you want to change the base?
rfc: distributed coordinator #1078
Changes from all commits
d623dcb
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to explain the security of the HA part a bit more. I think when we allow to directly set the Mesh components of the Coordinator during automatic recovery we loose protection against the Kubernetes admin/workload owner as they can redirect to themself and "provision" a coordinator with the values above.
The simplest case to think about is one Coordinator needing to be recovered and the workload owner answering the recovery call from this coordinator. The next time the user verifies the deployment, it sees a valid chain of manifests and therefore trusts the new MestCACert. This allows the workload owner to man-in-the-middle the TLS connection from the data owner to the application.
While we have excluded this threat model from our current recovery, I think HA and auto-recovery can be implemented while securing against this threat model as well e.g., via only allowing recovery of coordinator that have the same hashes. Of course, this breaks the upgrade process, but I think we have to drop coordinator upgrades anyway in this threat model.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, thanks!
We should aim to support recovery from heterogeneous coordinators if they are explicitly allowed by the manifest. An upgrading workload owner could first set a manifest including new coordinator policies, then deploy new coordinators, then remove old coordinators, then remove old coordinator policies from the manifest. A data owner would need to verify not only the current manifest, but also the history of allowed coordinators.
I think what we need are the following invariants:
M
only sends key material to pods that have thecoordinator
role inM
.M
uses only key material that it generated locally or that was received from a pod with thecoordinator
role inM
.(A) should be covered sufficiently by the current proposal text, and we could modify it to achieve (B) as follows:
coordinator
role in the temp manifest.RecoverResponse
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that is a nice summary and solution. Though, I think we could omit the roles from the manifest if we'd need to simplify it, since in the eyes of the data owner all code of all components inside the mesh/deployment must be trusted anyway. I.e., the workload owner should never be able to impersonate any component from the deployment, then all guarantees of shielding the data owner against the workload owner are broken.
This way the verification of the endpoint that recovers the coordinator could be simplified so that the recovering coordinator is first "initialized" by the running coordinator, like any other workload. Then the application logic notices that it was recovered and requests the locally kept keys of the running coordinator. The endpoint that is called here resides behind mTLS client auth.
The big trade off here, I think, is due to the asynchronous/split nature there is more potential that future changes break the security on the other hand I think the two parts are conceptually quite simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should still keep roles, for the following reasons:
Just for the record, right now the initializers don't attest the Coordinator and we don't use a server cert for the meshapi. If we were using the root CA cert for the meshapi server, we'd still only be able to verify that
We still don't know whether (B) holds, though. If we add attestation to the coordinator, we could rely on a variant of (A) - "[...] sends the correct key material for its current manifest [...]" and thus know
But we're still missing the link between the manifest we're recovering and the seed. I'll try to work that part into the proposal.