Skip to content

[BUG] Distributed cache mode causes Permission.Check to hang and timeout in Kubernetes #2559

@shaharlimor1

Description

@shaharlimor1

Describe the bug
When running Permify on Kubernetes with multiple replicas and distributed cache enabled, Permission.Check calls hang for ~4s and then fail with DeadlineExceeded or Canceled. With distributed cache disabled, everything works fine. Logs show repeated request forwarding messages that look like a forwarding loop until the call times out.

To Reproduce
Steps to reproduce the behavior:
1. Deploy Permify (tested with ghcr.io/permify/permify:v1.4.5 and latest) on Kubernetes with 2 replicas.
2. Backend database: Postgres.
3. Configure distributed cache with:
PERMIFY_DISTRIBUTED_ENABLED: "true" PERMIFY_DISTRIBUTED_ADDRESS: "kubernetes:///<service-name>.<namespace>.svc.cluster.local" PERMIFY_DISTRIBUTED_PORT: "5000"
4. Whenever im doing some action that required permify im getting log entry of - Call Permission.Check.
5. The call hangs for ~4s and fails with DeadlineExceeded.

log example:

time=2025-10-15T12:57:40.125Z level=ERROR msg="rpc error: code = Canceled desc = context canceled"
time=2025-10-15T12:57:40.126Z level=DEBUG msg="A context-related error occurred" error="context canceled"
time=2025-10-15T12:57:40.126Z level=ERROR msg=ERROR_CODE_CANCELLED
time=2025-10-15T12:57:40.126Z level=ERROR msg="finished call" protocol=grpc grpc.component=server grpc.service=base.v1.Permission grpc.method=Check grpc.method_type=unary peer.address=<pod-ip>:<port> grpc.start_time=2025-10-15T12:57:40Z grpc.request.deadline=2025-10-15T12:57:44Z grpc.code=Internal grpc.error="rpc error: code = Internal desc = ERROR_CODE_CANCELLED" grpc.time_ms=110.273

Example Application
private repo

Expected behavior
Distributed cache should sync correctly across pods, and Permission.Check should return without hanging or timing out.

Additional context

  • Running with a single replica (distributed disabled) works fine - no errors. Problem only appears with multiple replicas + distributed cache.
  • Health checks: /healthz returns {"status":"SERVING"} for each pod.
  • Networking: Verified with netstat and nc that ports 3476, 3478, and 5000 are listening and open between pods (cross-pod connectivity confirmed).
  • RBAC: The service account has get/list/watch access for Services, Endpoints, and EndpointSlices.
  • Service: Using a ClusterIP service exposing ports 3476, 3478, 5000. Selector is correct and routes to all pods.
  • Config variations tested: PERMIFY_DISTRIBUTED_ADDRESS with and without port.
    Tried both kubernetes:///..svc.cluster.local and kubernetes:///.:5000.

Environment (please complete the following information, because it helps us investigate better):

  • OS: Kubernetes (GKE - 1.33.4)
  • Version: 1.4.5
  • Database: Postgres

Metadata

Metadata

Labels

bugSometing isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions