-
-
Notifications
You must be signed in to change notification settings - Fork 275
Description
Describe the bug
When running Permify on Kubernetes with multiple replicas and distributed cache enabled, Permission.Check calls hang for ~4s and then fail with DeadlineExceeded or Canceled. With distributed cache disabled, everything works fine. Logs show repeated request forwarding messages that look like a forwarding loop until the call times out.
To Reproduce
Steps to reproduce the behavior:
1. Deploy Permify (tested with ghcr.io/permify/permify:v1.4.5 and latest) on Kubernetes with 2 replicas.
2. Backend database: Postgres.
3. Configure distributed cache with:
PERMIFY_DISTRIBUTED_ENABLED: "true" PERMIFY_DISTRIBUTED_ADDRESS: "kubernetes:///<service-name>.<namespace>.svc.cluster.local" PERMIFY_DISTRIBUTED_PORT: "5000"
4. Whenever im doing some action that required permify im getting log entry of - Call Permission.Check.
5. The call hangs for ~4s and fails with DeadlineExceeded.
log example:
time=2025-10-15T12:57:40.125Z level=ERROR msg="rpc error: code = Canceled desc = context canceled"
time=2025-10-15T12:57:40.126Z level=DEBUG msg="A context-related error occurred" error="context canceled"
time=2025-10-15T12:57:40.126Z level=ERROR msg=ERROR_CODE_CANCELLED
time=2025-10-15T12:57:40.126Z level=ERROR msg="finished call" protocol=grpc grpc.component=server grpc.service=base.v1.Permission grpc.method=Check grpc.method_type=unary peer.address=<pod-ip>:<port> grpc.start_time=2025-10-15T12:57:40Z grpc.request.deadline=2025-10-15T12:57:44Z grpc.code=Internal grpc.error="rpc error: code = Internal desc = ERROR_CODE_CANCELLED" grpc.time_ms=110.273
Example Application
private repo
Expected behavior
Distributed cache should sync correctly across pods, and Permission.Check should return without hanging or timing out.
Additional context
- Running with a single replica (distributed disabled) works fine - no errors. Problem only appears with multiple replicas + distributed cache.
- Health checks: /healthz returns {"status":"SERVING"} for each pod.
- Networking: Verified with netstat and nc that ports 3476, 3478, and 5000 are listening and open between pods (cross-pod connectivity confirmed).
- RBAC: The service account has get/list/watch access for Services, Endpoints, and EndpointSlices.
- Service: Using a ClusterIP service exposing ports 3476, 3478, 5000. Selector is correct and routes to all pods.
- Config variations tested: PERMIFY_DISTRIBUTED_ADDRESS with and without port.
Tried both kubernetes:///..svc.cluster.local and kubernetes:///.:5000.
Environment (please complete the following information, because it helps us investigate better):
- OS: Kubernetes (GKE - 1.33.4)
- Version: 1.4.5
- Database: Postgres