Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consistency issues due to the use of mount binds #74

Open
raskyld opened this issue Dec 6, 2024 · 5 comments
Open

Consistency issues due to the use of mount binds #74

raskyld opened this issue Dec 6, 2024 · 5 comments

Comments

@raskyld
Copy link

raskyld commented Dec 6, 2024

Hello,

We ended-up in a really edgy problem with our https://github.com/cert-manager/csi-driver-spiffe.
Basically, unlike most other Kubernetes CSI drivers, it uses a first tmpfs that will contains both metadata for volumes and SPIFFE SVID certificates (x509 certificates issued by cert-manager). Then, it answers to NodePublishVolume CSI RPC requests by making a mount with bind + ro from this tmpfs to the target_path given by the Container Orchestrator (in our case Kube).

The issue is that, on restart or failure, the bind no longer point to the valid tmpfs directory, we end up with a never renewed certificates from the POV of pods. One could say, it's fine if the instance of csi-driver-spiffe is never rebooted, you never have to care about that. But the issue is a bit more complex than that!

Since this driver uses RequiresRepublish, kubelet will issue A LOT of repeated calls to NodePublishVolume and any failure will lead to an unmount, but this will not be propagated to the mount_namespace of the running pods!. Hence, requiring users to restart the pod manually.

This is a real problem when used with SPIFFE SVID since they are typically short-lived and rotated often. A failure in this critical rotation mechanism leads to major workload disruptions.

I am revamping parts of this library in a fork to:

  1. Split metadata and data: metadata ends up in a local tmpfs and data gets its own tmpfs directly mounted at target_path where we will perform atomic updates. No more mount binds so it is simpler to manage.
  2. If the target_path is already correctly mounted, avoid unmount or doing any destructive changes since the running pods wouldn't see that.
  3. Improve the IsMounted detection to ensure it's not just any mount point but one as expected by this library.
  4. Add some metrics / stats collection.
  5. Change error reporting in CSI RPC since they have side-effects, please see: Kubelet deletes CSI mount points if a "requiresrepublish" call to NodePublishVolume returns an error kubernetes/kubernetes#121271 (comment)

Please, tell me if any of those changes seems too risky / if we have a good reason to have a mount bind instead of two separate mounts etc..

Also, tell me if you are interested in getting that merged upstream :)

Thanks 🙏

@munnerz
Copy link
Member

munnerz commented Dec 6, 2024

Have you looked at how the csi-lib example handles this? https://github.com/cert-manager/csi-lib/blob/main/deploy/cert-manager-csi-driver.yaml#L104

Namely, we use a host path volume to pass through a directory from the host, and on startup the csi-lib mounts a tmpfs volume into a directory in this directory, with bidirectional mount propagation, allowing the temporary filesystem to persist across restarts of the csi driver pod 😊

Hope that makes sense - this is the recommended way to configure/run drivers built on csi-lib.

@munnerz
Copy link
Member

munnerz commented Dec 6, 2024

It seems like you are using a host path as recommended already, but have not enabled Bidirectional mount propagation (meaning the tmpfs is not available in the host path in the new container as that mount point only exists within the old one). Making it bidirectional allows the host to see this mount point too, making it accessible to any other containers that later start up and reference that same hostPath.

@raskyld
Copy link
Author

raskyld commented Dec 6, 2024

Hello!

Thanks a lot for your prompt answer! 🙏
First a disclaimer: I am totally unrelated with the maintainers of the linked CSI, so I don't know their motivations for some implementation details 😬

So, just to be sure I understand, the tmpfs they mount at start time in the CSI data dir is not propagated to the host, that means, when the instance restart, the tmpfs is lost from the POV of the host right?

But then, is this mount bind made from this tmpfs (mounted inside the hostPath volume with bidirectional propagation) to a /var/lib/kubelet/pods/... (i.e., the target_path of the NodePublishVolume call) something that is used in other project aswell?

@raskyld
Copy link
Author

raskyld commented Dec 6, 2024

I think what I have a hard time to understand is how this mount bind (made inside the context of the driver pod) then propagate to the host. Will Bidirectional propagation also persist the mount bind from the POV of the host?

EDIT: Just realised that may be propagated but from this line instead https://github.com/cert-manager/csi-lib/blob/main/deploy/cert-manager-csi-driver.yaml#L101

Then I don't even understand how the Kernel dealt with the mount bind since the host was not aware of the "source" of the mount bind 🤯

@raskyld
Copy link
Author

raskyld commented Dec 6, 2024

I have a second question that is more about RequiresRepublish, is it something battle-tested and/or widely-used with csi-lib?
In my understanding, any failure occurring at this level https://github.com/cert-manager/csi-lib/blob/main/driver/nodeserver.go#L51 will lead to an unmount but it is something that should be typically avoided when using RequiresRepublish because you leave pods with "stale" mount point.

CSI drivers should only atomically update the contents of the volume. Mount point change will not be seen by a running container.

(Source: https://kubernetes-csi.github.io/docs/csi-driver-object.html under requiresRepublish notes)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants