Skip to content

Conversation

sunnylovestiramisu
Copy link
Contributor

@sunnylovestiramisu sunnylovestiramisu commented Sep 8, 2025

What type of PR is this?
/kind bug

What this PR does / why we need it:
This is the option 4 discussed in Leaked Volume Fix in external-provisioner Google doc.

The short term fix does not solve the issue of csi-provisioner restarting during the long provisioning time, however it is still an improvement. Also the short term fix does not have conflict with other long term fixes if we want to revisit in the future.

Which issue(s) this PR fixes:

kubernetes-sigs/sig-storage-lib-external-provisioner#152

Special notes for your reviewer:
Making changes in vendor/sigs.k8s.io/sig-storage-lib-external-provisioner/v11 to get initial review, I will move the change to sig-storage-lib-external-provisioner and bump up the sig-storage-lib-external-provisioner major version once this review is in good shape.

Tested via the following steps.

  1. kubetest --build --up from a k/k repo
  2. Add wait time to PDCSI driver similar to this branch
  3. Make the PDCSI driver consuming the external-provisioner change in this PR and deploy PDCSI driver
  4. Find the node of the csi-gce-pd-controller, for example
k get pod -n gce-pd-csi-driver 
NAME                                    READY   STATUS             RESTARTS        AGE
csi-gce-pd-controller-8dffb78ff-2xzbq   5/5     Running            0               27m
csi-gce-pd-node-54b8l                   2/2     Running            0               27m
csi-gce-pd-node-5fcwm                   2/2     Running            0               27m
csi-gce-pd-node-7g72v                   0/2     CrashLoopBackOff   15 (100s ago)   27m

csi-gce-pd-controller-8dffb78ff-2xzbq is on node e2e-test-songsunny-minion-group-kncr, for the following testing pod, specify a nodeSelector that is NOT on node e2e-test-songsunny-minion-group-kncr
5. Apply the following objects:

kubectl apply -f - <<EOF
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: pd-storage
provisioner: pd.csi.storage.gke.io
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
parameters:
  type: pd-balanced
EOF

kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: pd-pvc
spec:
  storageClassName: pd-storage
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 67Gi
EOF

kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
  name: test-pod
spec:
  containers:
  - name: nginx
    image: nginx
    volumeMounts:
    - name: pd-volume
      mountPath: /data/pd
  nodeSelector:
    kubernetes.io/hostname: e2e-test-songsunny-minion-group-302x
  volumes:
  - name: pd-volume
    persistentVolumeClaim:
      claimName: pd-pvc
      readOnly: false
EOF
  1. Delete the node e2e-test-songsunny-minion-group-302x, and then the Pod
  2. Make sure the pv object can still be created even node is deleted

Does this PR introduce a user-facing change?:

Add In Memory Cache of Node/CSINode Info for Long Provisioning to Prevent Volume Leak

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/bug Categorizes issue or PR as related to a bug. labels Sep 8, 2025
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Sep 8, 2025
@sunnylovestiramisu
Copy link
Contributor Author

/assign @msau42

@@ -1486,31 +1486,19 @@ func (ctrl *ProvisionController) provisionClaimOperation(ctx context.Context, cl
return ProvisioningFinished, errStopProvision
}

var selectedNode *v1.Node
var selectedNodeName string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsafrane Should we just make a new major version instead of making it an option for other provisioners?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, so yes, it will require a new major version.
I don't see where you make it an option, this PR looks very backward incompatible. And I am not against.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The alternative, if we think non-csi provisioners still want the old behavior, is to add the new changes as a feature gate. I'm not aware of any non-csi provisioners that support topology though and they would still run into the topology bug if we keep the old way.

So I am also in favor of just making a new major.

@@ -417,7 +417,7 @@ func main() {
if supportsMigrationFromInTreePluginName != "" {
provisionerOptions = append(provisionerOptions, controller.AdditionalProvisionerNames([]string{supportsMigrationFromInTreePluginName}))
}

pvcNodeStore := ctrl.NewInMemoryStore()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the driver doesn't support topology, then I think the cache is not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we need to add Topology check for all the cache code.

// a TopologyInfo object by its name.
// The name is pvcNamespace/pvcName
type TopologyProvider interface {
GetByName(name string) (*TopologyInfo, error)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make it very clear what the key is, what if we make it v1.PersistentVolumeClaim* instead of a string?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if PVC is deleted? Also currently we only pass in pvcName and pvcNamespace in the GenerateAccessibilityRequirements. Can changing to a pointer to PVC cause problems?

// First, check if the key exists to provide a helpful error.
_, found := s.data[name]
if !found {
return fmt.Errorf("cannot delete: object with name '%s' not found", name)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to error if it doesn't exist? I don't think the caller will do anything other than ignore the error. If you do want to return an error, I would make an explicit error type for NotFoundError, so that the caller can distinguish it from other errors.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for returning no error.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to just log a warning and not returning error?

return nil, err
}
if len(cacheInfo.TopologyKeys) == 0 {
return nil, fmt.Errorf("no topology key found in the CSINode in memory cache %s", nodeName)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a special error message just for no keys/label from cache? We already have a warning message that we're going to lookup the cache.

Copy link
Contributor Author

@sunnylovestiramisu sunnylovestiramisu Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do not need the special error message but we should return an error. Maybe errors.New("")?

if selectedNode != nil {
selectedCSINode, err = getSelectedCSINode(csiNodeLister, selectedNode)
var topologyKeys []string
if len(selectedNodeName) > 0 {
Copy link
Collaborator

@msau42 msau42 Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is very hard to follow with the cache lookups and updates being interspersed with condition checks. I think it would be easier to read if we wrapped the informer/cache lookup logic in one call, and kept the core GenerateAccessibility function clean with just business logic. Something like:

func topologyLookup(selectedNodeName) {

csiNode, err1 := getSelectedCSINode(csiNodeLister, selectedNodeName)
node,err2 := getNode(nodeLister, selectedNodeName)

if csiNode == nil || node  == nil {
  topologyInfo = cache.getTopologyInfo(pvc)
} else {
  topologyInfo = { getTopologyKeys(csiNode), getNodeLabels(node, topologyKeys)}
  cache.addToplogyInfo(topologyInfo)
}

return topologyInfo
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also consider looking up cache first, then if it doesn't exist in cache, then lookup in informers. I think switching that logic will solve the immediate binding scenario I outlined in #1413 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer checking the informers first, the node might have gotten new labels as the CSI driver just got installed there.

}
selectedNodeName = nodeName
} else {
logger.Info("No selected node annotation")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will spam for drivers that don't support topology or use Immediate binding

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so we just do not log anything

return term, node.Labels, false
}
} else {
// nodeLister is nil, read from the cache directly
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this possible? I think nodeLister is required.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original logic in the lib is:

if ctrl.nodeLister != nil {
	selectedNode, err = ctrl.nodeLister.Get(nodeName)
} else {
	selectedNode, err = ctrl.client.CoreV1().Nodes().Get(ctx, nodeName, metav1.GetOptions{}) // TODO (verult) cache Nodes
}

So I was thinking nodeLister can be nil? I am not sure when will it occur though.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the csi-provisioner it is mandatory as long as the plugin supports topology: https://github.com/kubernetes-csi/external-provisioner/blob/master/cmd/csi-provisioner/csi-provisioner.go#L339

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay then we can remove the nodeLister nil check

klog.Warningf("no node labels for node %s found in memory cache", nodeName)
return nil, nil, true
}
for _, key := range topologyKeys {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic looks repeated 3 times in this function. It would be better to share it across all 3 branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed one for the nodeLister is nil code. However the rest two, the first one read nodeLabels from Cache, the other one read nodeLabels from the node. I think we can make them to a processTerm func.

@msau42
Copy link
Collaborator

msau42 commented Sep 9, 2025

For your e2e test case steps, can you also test deleting the PVC object along with the Pod and Node to trigger the leak? I guess the leak is triggered by 2 scenarios:

  1. PVC is deleted along with Pod and Node
  2. PVC is not deleted, but Pod and Node are deleted. AND the scheduler retries in a different zone.

Can you verify you can repro the issue with both those scenarios, and that the fix works in both those scenarios?

I'm trying to think if AllowedTopologies or Immediate binding mode would be impacted. AllowedTopologies shouldn't be impacted because we grab the topology directly from it. I think Immediate binding could be impacted in a scenario like:

  1. Start provisioning in zone A.
  2. During provisioning, scale down all the Nodes in zone A.
  3. Future retries will either error if no zones are found, or it will retry with a zone B, leaking the volume being provisioned in zone A.

@msau42
Copy link
Collaborator

msau42 commented Sep 9, 2025

Also in the testing steps, with the fix, verify that the PV object is created and then deleted when the disk gets deleted.


// TopologyProvider is an interface that defines the behavior for looking up
// a TopologyInfo object by its name.
// The name is pvcNamespace/pvcName
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO the key should be PVC UID.

Consider this scenario:

  1. User creates a PVC
  2. The provisioning takes a long time, i.e. many timeouts + retries.
  3. User deletes the PVC and creates a new one, with the same name, but different StorageClass.
  4. The original PVC is still being provisioned. Despite it has been deleted from the API server, the provisioner waits for a final error.
  5. The new PVC should be provisioned too.
    -> Two PVCs have the same name and are being provisioned, but they have a different StorageClass and thus different topology cache.

PVC UID will be unique for these PVCs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And PVC UID guarantees uniqueness as well.

// First, check if the key exists to provide a helpful error.
_, found := s.data[name]
if !found {
return fmt.Errorf("cannot delete: object with name '%s' not found", name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for returning no error.

@@ -1486,31 +1486,19 @@ func (ctrl *ProvisionController) provisionClaimOperation(ctx context.Context, cl
return ProvisioningFinished, errStopProvision
}

var selectedNode *v1.Node
var selectedNodeName string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change, so yes, it will require a new major version.
I don't see where you make it an option, this PR looks very backward incompatible. And I am not against.

mayReschedule,
state,
err)
if IsFinalError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand it correctly, as long as the PVC exists and the provisioner lib calls Provision(), the external-provisioner re-loads the topology from the API server, adds it to the cache and deletes it from the cache on every final error.

It's actually a smart idea that simplifies the cache cleanup. It's just not very obvious from the cache name and its comments. I somehow expected it caches the topology forever and deletes it after PVC is deleted. A comment on the right place (the struct definition?) would help.

if selectedNode != nil {
selectedCSINode, err = getSelectedCSINode(csiNodeLister, selectedNode)
var topologyKeys []string
if len(selectedNodeName) > 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer checking the informers first, the node might have gotten new labels as the CSI driver just got installed there.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sunnylovestiramisu
Once this PR has been reviewed and has the lgtm label, please ask for approval from msau42. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@sunnylovestiramisu: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-kubernetes-csi-external-provisioner-unit 98536bb link true /test pull-kubernetes-csi-external-provisioner-unit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@@ -50,22 +51,24 @@ func (s *InMemoryStore) Add(name string, info *TopologyInfo) {

// Delete implements the TopologyProvider interface.
// It uses the built-in delete() function to remove the item from the map.
// The entry is deleted when provision succeeds or returns a final error.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tue regardless of implementation. I would probably put it as a comment in the interface

@@ -692,7 +692,7 @@ func (p *csiProvisioner) prepareProvision(ctx context.Context, claim *v1.Persist
requirements, err := GenerateAccessibilityRequirements(
p.client,
p.driverName,
claim.Namespace,
string(claim.UID),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to convert the key to a string? Can we use the direct type as the key?

@@ -105,6 +106,78 @@ func SupportsTopology(pluginCapabilities rpc.PluginCapabilitySet) bool {
utilfeature.DefaultFeatureGate.Enabled(features.Topology)
}

func topologyLookup(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not quite what I meant about the wrapper function.

I think it would be simpler to understand the code if things like obj lookup (whether from informer or cache) is abstracted from business logic (how to behave when there are no keys, or strictTopology mode is set, etc).

So I think a better structure for GenerateAccessiblityRequirements would be something like:

func GenerateAccessibilityRequirements(...) (...) {

  // This returns the information from informer or cache, the caller doesn't care
   topologyInfo = topologyLookup(selectedNode)

  // Nothing in the logic after this should care if the information came from the cache or informer

   if len(topologyKeys) == 0) {
     // business logic to handle it
   }

  if strictTopology {
    // business logic to handle it
  }

  // everything else
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/bug Categorizes issue or PR as related to a bug. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants