Description
Hello,
We ended-up in a really edgy problem with our https://github.com/cert-manager/csi-driver-spiffe.
Basically, unlike most other Kubernetes CSI drivers, it uses a first tmpfs
that will contains both metadata for volumes and SPIFFE SVID certificates (x509 certificates issued by cert-manager). Then, it answers to NodePublishVolume
CSI RPC requests by making a mount
with bind
+ ro
from this tmpfs
to the target_path
given by the Container Orchestrator (in our case Kube).
The issue is that, on restart or failure, the bind no longer point to the valid tmpfs
directory, we end up with a never renewed certificates from the POV of pods. One could say, it's fine if the instance of csi-driver-spiffe is never rebooted, you never have to care about that. But the issue is a bit more complex than that!
Since this driver uses RequiresRepublish
, kubelet
will issue A LOT of repeated calls to NodePublishVolume
and any failure will lead to an unmount
, but this will not be propagated to the mount_namespace
of the running pods!. Hence, requiring users to restart the pod manually.
This is a real problem when used with SPIFFE SVID since they are typically short-lived and rotated often. A failure in this critical rotation mechanism leads to major workload disruptions.
I am revamping parts of this library in a fork to:
- Split metadata and data: metadata ends up in a local
tmpfs
and data gets its owntmpfs
directly mounted attarget_path
where we will perform atomic updates. No more mount binds so it is simpler to manage. - If the
target_path
is already correctly mounted, avoid unmount or doing any destructive changes since the running pods wouldn't see that. - Improve the
IsMounted
detection to ensure it's not just any mount point but one as expected by this library. - Add some metrics / stats collection.
- Change error reporting in CSI RPC since they have side-effects, please see: Kubelet deletes CSI mount points if a "requiresrepublish" call to NodePublishVolume returns an error kubernetes/kubernetes#121271 (comment)
Please, tell me if any of those changes seems too risky / if we have a good reason to have a mount bind instead of two separate mounts etc..
Also, tell me if you are interested in getting that merged upstream :)
Thanks 🙏