-
Notifications
You must be signed in to change notification settings - Fork 651
Description
I was working on debugging compatibility with the CephCSI driver and stumbled upon an issue in how plugin names and aliases are handled.
Installed the driver using this cmd:
docker plugin rm rbd.csi.ceph.com --force && \
docker plugin install <privaterepo>/ceph-csi-swarm/cephcsi-aio:canary \
--grant-all-permissions \
NODE_ID=test-docker-1 \
CEPHCSI_VERBOSITY=10 \
DEBUG_ENTRYPOINT=true \
--alias rbd.csi.ceph.com
Created CSI volume and a service using it fine. Basically, everything worked up until volume publishing, where I was met with an error message from Docker stating:
CSI node ID not found for given Swarm node ID.
Looking at the code:
That log only triggers if this lookup fails:
csiNodeID := p.swarmToCSI[nodeID]
if csiNodeID == "" { ... }
There's two maps in the plugin object swarmToCSI and csiToSwarm.
These are correctly populated upon daemon startup, which I checked using debug statements:
At the point in time where these maps are accessed, they seem to be emtpy:
I further tested this by logging the ptr of the underlying plugin object which confirmed there were two "virtual" instances of the same plugin at runtime:
At start:
At use:
Looking at where plugin object instances are created, manager/csi/manager.go, and the usages thereof, it seems like handleNode gets the plugin's name from info.PluginName, (
swarmkit/manager/csi/manager.go
Line 406 in 3a23580
| p, err := vm.getPlugin(info.PluginName) |
v.Spec.Driver.Name.
This effectively causes two driver objects to be created from one installed driver, one with the tag in the alias and one without, where one instance is correctly populated with values at startup, but the other unpopulated instance is what is actually used later in the code, causing the CSI driver to fail at NodePublishVolume.
To fix this I modified the getPlugin method to normalize/canonicalize plugin names correctly.
https://github.com/ppignet/swarmkit/tree/csi/manager_normalize_plugin_name
I confirmed that this fixes the issue by veryfing the ptr is now identical:
upon usage:
..., and now being able to spin up many volumes and services without errors anymore! 😺
Lmk if this is a non-issue and just me doing something wrong. But I think this addition would make handling more fail-proof in any case.