Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retry obtaining KUBECONFIG file lock #3782

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

stmcginnis
Copy link
Contributor

When there are multiple concurrent create or delete operations, there can be different processing trying to lock the KUBECONFIG file at the same time. This leads to small windows where getting the lock files.

We don't fail the operation if something like "kind delete cluster" isn't able to remove its settings from KUBECONFIG. This is ideal and what most users would expect, but it also means there may be some leftover crud left behind that they are not aware of.

This adds some basic retries to obtaining the file lock. It potentially adds a very small delay (in a much longer process), but makes it more likely that the process is able to get the lock and complete the KUBECONFIG update.

Related: #3781

When there are multiple concurrent create or delete operations, there
can be different processing trying to lock the KUBECONFIG file at the
same time. This leads to small windows where getting the lock files.

We don't fail the operation if something like "kind delete cluster"
isn't able to remove its settings from KUBECONFIG. This is ideal and
what most users would expect, but it also means there may be some
leftover crud left behind that they are not aware of.

This adds some basic retries to obtaining the file lock. It potentially
adds a very small delay (in a much longer process), but makes it more
likely that the process is able to get the lock and complete the
KUBECONFIG update.

Signed-off-by: Sean McGinnis <[email protected]>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Nov 7, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: stmcginnis
Once this PR has been reviewed and has the lgtm label, please assign bentheelder for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from aojea November 7, 2024 19:39
@k8s-ci-robot k8s-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Nov 7, 2024
if err != nil {
return err

// Retry obtaining the file lock a few times to accommodate concurrent operations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is safe, if there is a lock we're racing with another process anyhow? (which may logically conflict) and we're deviating in behavior from client-go

I think we probably need to retry the entire kubeconfig edit call instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this should be the safe route. Once we obtain the lock, there shouldn't be any concerns about other client-go (or other libraries) that also use file locks to update the KUBECONFIG. All of the update operations happen between that window of obtaining the file lock and removing it.

When we update the contents, we do both read and write within this locked section.

Same with removing an entry. The file lock is obtained first, and only then are updates made.

The only difference from client-go is it will immediately fail if another process already has that lock. We will be a little more robust and just retry, where typical operations should be complete within milliseconds.

So I believe the only risk if there is another process that is not using the same file locking mechanism to control access to KUBECONFIG. As long is it operates under the same principle of getting the lock before reading and updating, then the only difference is we might be able to wait until the other process is done before completely failing the operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants