docs: add guide for scaling down nodes (#1303)

zxh326 · web-flow · commit c3d9ab9d9a6d · 2025-03-21T14:23:11.000+08:00
diff --git a/docs/en/administration/going-production.md b/docs/en/administration/going-production.md
@@ -13,7 +13,7 @@ Best practices and recommended settings when going production.
 * The `--writeback` option is strongly advised against, as it can easily cause data loss especially when used inside containers, if not properly managed. See ["Write Cache in Client (Community Edition)"](/docs/community/guide/cache#client-write-cache) and ["Write Cache in Client (Cloud Service)"](/docs/cloud/guide/cache#client-write-cache);
 * When cluster resources are limited, refer to techniques in [Resource Optimization](../guide/resource-optimization.md#mount-pod-resources) for optimization;
 * It's recommended to set non-preempting PriorityClass for Mount Pod, see [documentation](../guide/resource-optimization.md#set-non-preempting-priorityclass-for-mount-pod) for details.
-* It's recommended to set PodDisruptionBudget for Mount Pod, see [documentation](../guide/resource-optimization.md#set-poddisruptionbudget-for-mount-pod) for details.
+* Best practices for reducing node capacity. see [documentation](#scale-down-node)。
 
 ## Sidecar recommendations {#sidecar}
 
@@ -455,3 +455,49 @@ spec:
     fsGroup: 2000
     fsGroupChangePolicy: "OnRootMismatch"
 ```
+
+## Scale Down {#scale-down-node}
+
+The cluster manager may need to drain a node for maintenance or upgrading. It may also be necessary to rely on [Cluster Auto-Scaling Tools](https://kubernetes.io/docs/concepts/cluster-administration/node-autoscaling) for automatic scaling of the cluster.
+
+When a node is being drained, Kubernetes will evict all Pods on the node, including Mount Pods. However, if a Mount Pod is evicted prematurely, that will cause error when the remaining application Pods try to access the JuiceFS PV. Moveover, Mount Pod will be re-created by CSI Node, since it's still being referenced by application Pods, leading to a restart loop, while all JuiceFS file system requests ends with an error.
+
+To avoid this from happening, read below sections.
+
+### Use PodDisruptionBudget {#pdb}
+
+Set [PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb) for the Mount Pod. PDB will ensure that the Mount Pod is protected when the node is drained, until all application Pods that reference this Mount Pod is evicted, thus application Pods can continue normal access towards the JuiceFS PV during the node drain. As an example:
+
+```yaml
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+   name: jfs-pdb
+   namespace: kube-system  # The namespace where JuiceFS CSI is installed
+spec:
+   minAvailable: "100%"  # Protect Mount Pod during a node drain
+   selector:
+      matchLabels:
+         app.kubernetes.io/name: juicefs-mount
+```
+
+:::note Compatibility
+Different service providers make their own modifications on Kubernetes, some of which breaks PDB, if this is the case, refer to the next section to use Validating Webhook to protect Mount Pod.
+:::
+
+### Use validating webhook {#validating-webhook}
+
+In certain Kubernetes environments, PDB does not work as expected (e.g. [Karpenter](https://github.com/aws/karpenter-provider-aws/issues/7853)), in which if PDB is created, scaling down no longer works properly.
+
+To prevent this situation, you can use our Validating Webhook instead. When CSI Driver detects that an evicted Mount Pod is still being used, it will simply reject any eviction. The autoscaling tools will enter a retry loop until the Mount Pod is successfully deleted by CSI Node. To enable this feature, refer to this Helm configuration:
+
+:::note
+This feature requires at least JuiceFS CSI Driver v0.27.1.
+:::
+
+```yaml
+validatingWebhook:
+  enabled: true
+```
+
+When using the [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler), if a node cannot be scaled down due to the existence of Mount Pod, it might be because that the Cluster Autoscaler cannot evict [Not Replicated Pods](https://github.com/kubernetes/autoscaler/issues/351), preventing normal scale-down operations. In this case, try the `cluster-autoscaler.kubernetes.io/safe-to-evict: "true"` annotation on the Mount Pods while utilizing the aforementioned webhook to achieve proper node scale-down.
diff --git a/docs/en/guide/resource-optimization.md b/docs/en/guide/resource-optimization.md
@@ -243,25 +243,6 @@ However, when the Mount Pod is created, if the node resources are insufficient,
    kubectl -n kube-system set env -c juicefs-plugin statefulset/juicefs-csi-controller JUICEFS_MOUNT_PRIORITY_NAME=juicefs-mount-priority-nonpreempting JUICEFS_MOUNT_PREEMPTION_POLICY=Never
    ```
 
-## Set PodDisruptionBudget for Mount Pod {#set-poddisruptionbudget-for-mount-pod}
-
-The cluster manager may need to drain a node for maintenance or upgrading. When a node is being drained, Kubernetes will evict all Pods on the node, including Mount Pods. However, Mount Pod's eviction will cause that all application Pods can not use JuiceFS PV. In addition, Mount Pod will be re-created when CSI Node detects it is stilled used by application Pod, which will lead to a deleted-recreated loop of Mount Pod.
-
-To avoid that situation happening, you can set [PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb) for the Mount Pod. The PodDisruptionBudget will ensure that the Mount Pod is not evicted when the node is drained, until the related application Pod is evicted, and then CSI Node will delete it. So that it can ensure application Pod's usage of JuiceFS PV during the drain, avoid the deleted-recreated loop of Mount Pod, and do not affect the drain operation. Here is an example:
-
-```yaml
-apiVersion: policy/v1
-kind: PodDisruptionBudget
-metadata:
-   name: jfs-pdb
-   namespace: kube-system  # The namespace where JuiceFS CSI is located
-spec:
-   minAvailable: "100%"    # avoid all Mount Pods are evicted during node's drain
-   selector:
-      matchLabels:
-         app.kubernetes.io/name: juicefs-mount
-```
-
 ## Share Mount Pod for the same StorageClass {#share-mount-pod-for-the-same-storageclass}
 
 By default, Mount Pod is only shared when multiple application Pods are using a same PV. However, you can take a step further and share Mount Pod (in the same node, of course) for all PVs that are created using the same StorageClass, under this policy, different application Pods will bind the host mount point on different paths, so that one Mount Pod is serving multiple application Pods.
diff --git a/docs/zh_cn/administration/going-production.md b/docs/zh_cn/administration/going-production.md
@@ -13,7 +13,7 @@ sidebar_position: 1
 * 不建议使用 `--writeback`，容器场景下，如果配置不当，极易引发丢数据等事故，详见[「客户端写缓存（社区版）」](/docs/zh/community/guide/cache#client-write-cache)或[「客户端写缓存（云服务）」](/docs/zh/cloud/guide/cache#client-write-cache)；
 * 如果资源吃紧，参照[「资源优化」](../guide/resource-optimization.md#mount-pod-resources)以调优；
 * 考虑为 Mount Pod 设置非抢占式 PriorityClass，避免资源不足时，Mount Pod 将业务容器驱逐。详见[文档](../guide/resource-optimization.md#set-non-preempting-priorityclass-for-mount-pod)；
-* 考虑为 Mount Pod 设置设置干扰预算 PodDisruptionBudget，避免排空节点时 Mount Pod 被驱逐。详见[文档](../guide/resource-optimization.md#set-poddisruptionbudget-for-mount-pod)。
+* 缩容节点的最佳实践。详见[文档](#scale-down-node)。
 
 ## Sidecar 模式推荐设置 {#sidecar}
 
@@ -457,3 +457,49 @@ spec:
     fsGroup: 2000
     fsGroupChangePolicy: "OnRootMismatch"
 ```
+
+## 缩容节点 {#scale-down-node}
+
+集群管理员有时会对节点进行排空（drain），以便维护节点、升级节点等。也有可能会依赖[集群自动扩缩容工具](https://kubernetes.io/zh-cn/docs/concepts/cluster-administration/cluster-autoscaling)对集群进行自动扩缩容。
+
+在排空节点时，Kubernetes 会驱逐节点上所有的 Pod，包括 Mount Pod。如果 Mound Pod 先于应用 Pod 被驱逐，会导致应用 Pod 无法访问 JuiceFS PV，并且 CSI Node 检查到 Mount Pod 意外退出，但却还有应用 Pod 使用时，会再次拉起，这样会导致 Mount Pod 处于删除 - 拉取的循环中，造成节点缩容无法正常进行，同时业务 Pod 访问 JuiceFS PV 报错的异常。
+
+为了避免缩容期间的异常，阅读以下小节了解如何处理。
+
+### 设置干扰预算（PodDisruptionBudget）{#pdb}
+
+可以为 Mount Pod 设置干扰预算（[PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb)）。干扰预算可以保证在排空节点时，Mount Pod 不会被驱逐，直到其对应的应用 Pod 被驱逐，CSI Node 会将其删除。这样既可以保证节点排空期间应用 Pod 对 JuiceFS PV 的访问，避免 Mount Pod 的删除 - 拉取循环，也不影响整个节点排空的流程。示例如下：
+
+```yaml
+apiVersion: policy/v1
+kind: PodDisruptionBudget
+metadata:
+  name: jfs-pdb
+  namespace: kube-system  # 对应 JuiceFS CSI 所在的命名空间
+spec:
+  minAvailable: "100%"    # 避免所有 Mount Pod 在节点排空时被驱逐
+  selector:
+    matchLabels:
+      app.kubernetes.io/name: juicefs-mount
+```
+
+:::note 兼容性
+不同的服务提供商都对 Kubernetes 进行了适配和改造，使得 PDB 未必能如预期般工作，如果出现这种情况，请参考下一小节，用 Webhook 来保证排空节点时，Mount Pod 不被过早驱逐。
+:::
+
+### 使用 Validating Webhook 拒绝驱逐 {#validating-webhook}
+
+某些 Kubernetes 环境中，PDB 并不如预期般工作（比如 [Karpenter](https://github.com/aws/karpenter-provider-aws/issues/7853)），如果使用了 PDB，可能会干扰自动扩缩容工具的正常缩容流程。
+
+面对这种情况，则不应使用 PDB，而是为 CSI 驱动启用 Validating Webhook。这样 CSI 驱动在检查到被驱逐的 Mount Pod 还有应用 Pod 使用时，会拒绝驱逐请求。自动扩缩容工具工具会持续重试，直到 Mount Pod 引用计数归零、被正常释放。通过 Helm 安装的示例如下：
+
+:::note
+此特性需使用 0.27.1 及以上版本的 JuiceFS CSI 驱动
+:::
+
+```yaml
+validatingWebhook:
+  enabled: true
+```
+
+如果你在使用 [Cluster Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) 工具时，如果在遇到含有 Mount Pod 的节点无法缩容的情况，可能是因为 Cluster Autoscaler 无法驱逐 [Not Replicated Pod](https://github.com/kubernetes/autoscaler/issues/351)，导致无法正常缩容。此时可以尝试为 Mount Pod 设置 `cluster-autoscaler.kubernetes.io/safe-to-evict: "true"` 注解，同时配合上述 webhook，来达到正常缩容的目的。
diff --git a/docs/zh_cn/guide/resource-optimization.md b/docs/zh_cn/guide/resource-optimization.md
@@ -243,25 +243,6 @@ CSI Node 在创建 Mount Pod 时，会默认给其设置 PriorityClass 为 `syst
    kubectl -n kube-system set env -c juicefs-plugin statefulset/juicefs-csi-controller JUICEFS_MOUNT_PRIORITY_NAME=juicefs-mount-priority-nonpreempting JUICEFS_MOUNT_PREEMPTION_POLICY=Never
    ```
 
-## 为 Mount Pod 设置干扰预算（PodDisruptionBudget）{#set-poddisruptionbudget-for-mount-pod}
-
-集群管理员有时会对节点进行排空（drain），以便维护节点、升级节点等。在排空节点时，Kubernetes 会驱逐节点上所有的 Pod，包括 Mount Pod。但是 Mound Pod 的驱逐可能会导致应用 Pod 无法访问 JuiceFS PV，并且 CSI Node 在检查到被驱逐的 Mount Pod 还有应用 Pod 使用时，会再次拉起，这样会导致 Mount Pod 处于删除 - 拉取的循环中。
-
-为了避免这种情况的发生，可以为 Mount Pod 设置干扰预算（[PodDisruptionBudget](https://kubernetes.io/docs/tasks/run-application/configure-pdb)）。干扰预算可以保证在排空节点时，Mount Pod 不会被驱逐，直到其对应的应用 Pod 被驱逐，CSI Node 会将其删除。这样既可以保证节点排空期间应用 Pod 对 JuiceFS PV 的访问，避免 Mount Pod 的删除 - 拉取循环，也不影响整个节点排空的流程。示例如下：
-
-```yaml
-apiVersion: policy/v1
-kind: PodDisruptionBudget
-metadata:
-  name: jfs-pdb
-  namespace: kube-system  # 对应 JuiceFS CSI 所在的命名空间
-spec:
-  minAvailable: "100%"    # 避免所有 Mount Pod 在节点排空时被驱逐
-  selector:
-    matchLabels:
-      app.kubernetes.io/name: juicefs-mount
-```
-
 ## 为相同的 StorageClass 复用 Mount Pod {#share-mount-pod-for-the-same-storageclass}
 
 默认情况下，仅在多个应用 Pod 使用相同 PV 时，Mount Pod 才会被复用。如果你希望进一步降低开销，可以更加激进地复用 Mount Pod，让使用相同 StorageClass 创建出来的所有 PV，都复用同一个 Mount Pod（当然了，复用只能发生在同一个节点）。不同的应用 Pod，将会绑定挂载点下不同的路径，实现一个挂载点为多个应用容器提供服务。