Skip to content

manager/csi/manager.go: getPlugin: Issue with handling plugin names related to image tags #3215

@ppignet

Description

@ppignet

I was working on debugging compatibility with the CephCSI driver and stumbled upon an issue in how plugin names and aliases are handled.

Installed the driver using this cmd:

docker plugin rm rbd.csi.ceph.com --force && \
docker plugin install <privaterepo>/ceph-csi-swarm/cephcsi-aio:canary \
--grant-all-permissions \
NODE_ID=test-docker-1 \
CEPHCSI_VERBOSITY=10 \
DEBUG_ENTRYPOINT=true \
--alias rbd.csi.ceph.com

Created CSI volume and a service using it fine. Basically, everything worked up until volume publishing, where I was met with an error message from Docker stating:
CSI node ID not found for given Swarm node ID.

Looking at the code:

That log only triggers if this lookup fails:

csiNodeID := p.swarmToCSI[nodeID]
if csiNodeID == "" { ... }

There's two maps in the plugin object swarmToCSI and csiToSwarm.
These are correctly populated upon daemon startup, which I checked using debug statements:

Image

At the point in time where these maps are accessed, they seem to be emtpy:

Image

I further tested this by logging the ptr of the underlying plugin object which confirmed there were two "virtual" instances of the same plugin at runtime:
At start:

Image

At use:

Image

Looking at where plugin object instances are created, manager/csi/manager.go, and the usages thereof, it seems like handleNode gets the plugin's name from info.PluginName, (

p, err := vm.getPlugin(info.PluginName)
) whereas all other codepaths get it from v.Spec.Driver.Name.

This effectively causes two driver objects to be created from one installed driver, one with the tag in the alias and one without, where one instance is correctly populated with values at startup, but the other unpopulated instance is what is actually used later in the code, causing the CSI driver to fail at NodePublishVolume.


To fix this I modified the getPlugin method to normalize/canonicalize plugin names correctly.
https://github.com/ppignet/swarmkit/tree/csi/manager_normalize_plugin_name

I confirmed that this fixes the issue by veryfing the ptr is now identical:

upon startup:
Image

upon usage:

Image Image

..., and now being able to spin up many volumes and services without errors anymore! 😺

Lmk if this is a non-issue and just me doing something wrong. But I think this addition would make handling more fail-proof in any case.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions