Skip to content

enable mutation detection in e2e #6541

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

XiShanYongYe-Chang
Copy link
Member

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Since we have disabled deepcopy during list operations in #5813 , we should enable the MutationDetector in our e2e tests to prevent any future code from mutating the cache.

Which issue(s) this PR fixes:

Part of #6516

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

@karmada-bot karmada-bot added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Jul 16, 2025
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from xishanyongye-chang. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 16, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @XiShanYongYe-Chang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on enhancing the stability and integrity of the Karmada control plane by enabling the MutationDetector within the karmada-controller-manager for e2e tests. This is a crucial step to prevent unintended modifications to cached objects, especially following a previous change that disabled deepcopy operations during list operations, thereby safeguarding against future cache mutation issues.

Highlights

  • Enable Cache Mutation Detection: I've updated the karmada-controller-manager deployment manifest (artifacts/deploy/karmada-controller-manager.yaml) to include a new environment variable, KUBE_CACHE_MUTATION_DETECTOR. This variable will control whether the cache mutation detector is active.
  • Automate Detector Activation in E2E: I've modified the hack/deploy-karmada.sh script to set the KUBE_CACHE_MUTATION_DETECTOR environment variable to true by default. This script now also dynamically injects this value into the karmada-controller-manager.yaml during deployment, ensuring the mutation detector is enabled for e2e tests.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enables the client-go cache mutation detector for the karmada-controller-manager in E2E tests. The implementation correctly uses the KUBE_CACHE_MUTATION_DETECTOR environment variable. A suggestion has been made to improve the deployment script by avoiding in-place modification of source files.

@@ -449,6 +453,9 @@ sed -i'' -e "s/{{ca_crt}}/${karmada_ca}/g" "${TEMP_PATH_BOOTSTRAP}"/bootstrap-to
sed -i'' -e "s|{{apiserver_address}}|${karmada_apiserver_address}|g" "${TEMP_PATH_BOOTSTRAP}"/bootstrap-token-configuration-tmp.yaml
kubectl --context="karmada-apiserver" apply -f "${TEMP_PATH_BOOTSTRAP}"/bootstrap-token-configuration-tmp.yaml

# render the KUBE_CACHE_MUTATION_DETECTOR in the karmada-controller-manager
sed -i'' -e "s/{{KUBE_CACHE_MUTATION_DETECTOR}}/${KUBE_CACHE_MUTATION_DETECTOR}/g" "${REPO_ROOT}/artifacts/deploy/karmada-controller-manager.yaml"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Modifying the file in-place can lead to a dirty git working tree and potential issues if the script is interrupted. Copy the YAML file to a temporary location, modify the copy, and then apply it to ensure the source files remain untouched.

Suggested change
sed -i'' -e "s/{{KUBE_CACHE_MUTATION_DETECTOR}}/${KUBE_CACHE_MUTATION_DETECTOR}/g" "${REPO_ROOT}/artifacts/deploy/karmada-controller-manager.yaml"
CONTROLLER_MANAGER_YAML_TMP="${TEMP_PATH_BOOTSTRAP}/karmada-controller-manager.yaml"
cp "${REPO_ROOT}/artifacts/deploy/karmada-controller-manager.yaml" "${CONTROLLER_MANAGER_YAML_TMP}"
sed -i'' -e "s/{{KUBE_CACHE_MUTATION_DETECTOR}}/${KUBE_CACHE_MUTATION_DETECTOR}/g" "${CONTROLLER_MANAGER_YAML_TMP}"
kubectl --context="${HOST_CLUSTER_NAME}" apply -f "${CONTROLLER_MANAGER_YAML_TMP}"

@codecov-commenter
Copy link

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 45.43%. Comparing base (9966a3f) to head (607e7f7).
Report is 7 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6541   +/-   ##
=======================================
  Coverage   45.43%   45.43%           
=======================================
  Files         687      687           
  Lines       56318    56334   +16     
=======================================
+ Hits        25587    25598   +11     
- Misses      29132    29138    +6     
+ Partials     1599     1598    -1     
Flag Coverage Δ
unittests 45.43% <ø> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@RainbowMango
Copy link
Member

Any documentation for the Mutation Detector feature?

@XiShanYongYe-Chang
Copy link
Member Author

Any documentation for the Mutation Detector feature?

I have not yet found the relevant functional introduction document.

@XiShanYongYe-Chang
Copy link
Member Author

During the execution of this test, it was discovered that the controller-manager woul paninnc. Here is an example of an error log:

CACHE *unstructured.Unstructured[68] ALTERED!
(*unstructured.Unstructured)({                                                                                                         (*unstructured.Unstructured)({
  Object: (map[string]interface {}) (len=4) {                                                                                            Object: (map[string]interface {}) (len=4) {
    (string) (len=8) "metadata": (map[string]interface {}) (len=6) {                                                                       (string) (len=8) "metadata": (map[string]interface {}) (len=4) {
      (string) (len=3) "uid": (string) (len=36) "2bfcde52-81a6-4a23-9915-57c8a8d355ed",                                                      (string) (len=15) "resourceVersion": (string) (len=4) "2855",
      (string) (len=15) "resourceVersion": (string) (len=4) "2855",                                                                          (string) (len=17) "creationTimestamp": (string) (len=20) "2025-07-16T01:45:03Z",
      (string) (len=17) "creationTimestamp": (string) (len=20) "2025-07-16T01:45:03Z",                                                       (string) (len=4) "name": (string) (len=32) "system:test-clusterrole-x525v-01",
      (string) (len=6) "labels": (map[string]interface {}) (len=1) {                                                                         (string) (len=3) "uid": (string) (len=36) "2bfcde52-81a6-4a23-9915-57c8a8d355ed"
        (string) (len=48) "clusterpropagationpolicy.karmada.io/permanent-id": (string) (len=36) "30033a80-2606-4bf5-9569-994581cc1973"     },
      },                                                                                                                                   (string) (len=5) "rules": (interface {}) <nil>,
      (string) (len=11) "annotations": (map[string]interface {}) (len=1) {                                                                 (string) (len=4) "kind": (string) (len=11) "ClusterRole",
        (string) (len=40) "clusterpropagationpolicy.karmada.io/name": (string) (len=17) "clusterrole-x525v"                                (string) (len=10) "apiVersion": (string) (len=28) "rbac.authorization.k8s.io/v1"
      },                                                                                                                                 }
      (string) (len=4) "name": (string) (len=32) "system:test-clusterrole-x525v-01"                                                    })
    },
    (string) (len=5) "rules": (interface {}) <nil>,
    (string) (len=4) "kind": (string) (len=11) "ClusterRole",
    (string) (len=10) "apiVersion": (string) (len=28) "rbac.authorization.k8s.io/v1"
  }
})


panic: cache *unstructured.Unstructured modified

goroutine 1638 [running]:
k8s.io/client-go/tools/cache.(*defaultCacheMutationDetector).CompareObjects(0xc000ee5ae0)
        /root/go/src/github.com/karmada-io/karmada/vendor/k8s.io/client-go/tools/cache/mutation_detector.go:165 +0x58e
k8s.io/client-go/tools/cache.(*defaultCacheMutationDetector).Run(0xc000ee5ae0, 0xc0012902a0)
        /root/go/src/github.com/karmada-io/karmada/vendor/k8s.io/client-go/tools/cache/mutation_detector.go:109 +0x149
k8s.io/client-go/tools/cache.(*sharedIndexInformer).RunWithContext.(*Group).StartWithChannel.func3()
        /root/go/src/github.com/karmada-io/karmada/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:55 +0x1b
k8s.io/apimachinery/pkg/util/wait.(*Group).Start.func1()
        /root/go/src/github.com/karmada-io/karmada/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:72 +0x4c
created by k8s.io/apimachinery/pkg/util/wait.(*Group).Start in goroutine 1364
        /root/go/src/github.com/karmada-io/karmada/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:70 +0x73

@zhzhuang-zju
Copy link
Contributor

Any documentation for the Mutation Detector feature?

I have not yet found the relevant functional introduction document.

Logs from the karmada-controller-manager component can indicate whether any cache modifications have occurred

@XiShanYongYe-Chang
Copy link
Member Author

Logs from the karmada-controller-manager component can indicate whether any cache modifications have occurred

Yes, it is. Unfortunately, the panic log does not directly indicate which specific line of code caused the issue.

@zhzhuang-zju
Copy link
Contributor

@XiShanYongYe-Chang While resolving #6513, I discovered that the detector performs an operation that modifies the cache. I will submit a PR to address this issue. And check whether it can pass this mutation detection after the fix.

@zhzhuang-zju
Copy link
Contributor

@XiShanYongYe-Chang In #6544, I mitigated the behavior that mutates the informer cache, and it appears from the logs that this issue has been resolved. However, I have some doubts about this:
The effect of the environment variable KUBE_CACHE_MUTATION_DETECTOR seems to be triggering a pod panic when a cache mutation is detected. However, since many of our e2e tests include failure retries, it's possible for the test to pass after the component restarts, thereby bypassing the mutation detection. For example, in this GitHub Actions run, only one e2e test failed while the other two passed, even though the controller-manager clearly restarted based on the logs.

@XiShanYongYe-Chang
Copy link
Member Author

The effect of the environment variable KUBE_CACHE_MUTATION_DETECTOR seems to be triggering a pod panic when a cache mutation is detected. However, since many of our e2e tests include failure retries, it's possible for the test to pass after the component restarts, thereby bypassing the mutation detection. For example, in this GitHub Actions run, only one e2e test failed while the other two passed, even though the controller-manager clearly restarted based on the logs.

Your question is excellent, and it's something I've been pondering as well.

I have two thoughts:

  1. The fact that E2E runs without issues doesn't necessarily mean that the karmada-controller-manager component hasn't restarted;
  2. A restart of the karmada-controller-manager component doesn't necessarily indicate that it was caused by a panic.

For the first point, perhaps we could introduce a check for component restarts, but as for the second point, I don't have any ideas at the moment.

@zhzhuang-zju
Copy link
Contributor

For the first point, perhaps we could introduce a check for component restarts, but as for the second point, I don't have any ideas at the moment.

When a pod restarts, there are some indicators that can be used to identify the reason for the restart. For example, the Last State of the container.

When a pod restarts due to panic, it has the following characteristics:

  • The Restart Count of the container is greater than 0
  • The Reason of the Last State is "Error"
  • The Exit Code of the Last State is 2

Although these characteristics are not unique to go panic, they can distinguish between oom and other types of restarts, reducing the risk of false positives.

$ kubectl describe pods --namespace karmada-system karmada-controller-manager-7b74766c6f-qlw72
Containers:
  karmada-controller-manager:
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Thu, 17 Jul 2025 16:36:54 +0800
      Finished:     Thu, 17 Jul 2025 16:38:15 +0800
    Ready:          True
    Restart Count:  1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants