[BUG] Distributed cache mode causes Permission.Check to hang and timeout in Kubernetes

**Describe the bug**
When running Permify on Kubernetes with multiple replicas and distributed cache enabled, Permission.Check calls hang for ~4s and then fail with DeadlineExceeded or Canceled. With distributed cache disabled, everything works fine. Logs show repeated request forwarding messages that look like a forwarding loop until the call times out.

**To Reproduce**
Steps to reproduce the behavior:
	1.	Deploy Permify (tested with ghcr.io/permify/permify:v1.4.5 and latest) on Kubernetes with 2 replicas.
	2.	Backend database: Postgres.
	3.	Configure distributed cache with:
`PERMIFY_DISTRIBUTED_ENABLED: "true"
PERMIFY_DISTRIBUTED_ADDRESS: "kubernetes:///<service-name>.<namespace>.svc.cluster.local"
PERMIFY_DISTRIBUTED_PORT: "5000"`
	4.	Whenever im doing some action that required permify im getting log entry of - Call Permission.Check.
	5.	The call hangs for ~4s and fails with DeadlineExceeded.

log example:
```
time=2025-10-15T12:57:40.125Z level=ERROR msg="rpc error: code = Canceled desc = context canceled"
time=2025-10-15T12:57:40.126Z level=DEBUG msg="A context-related error occurred" error="context canceled"
time=2025-10-15T12:57:40.126Z level=ERROR msg=ERROR_CODE_CANCELLED
time=2025-10-15T12:57:40.126Z level=ERROR msg="finished call" protocol=grpc grpc.component=server grpc.service=base.v1.Permission grpc.method=Check grpc.method_type=unary peer.address=<pod-ip>:<port> grpc.start_time=2025-10-15T12:57:40Z grpc.request.deadline=2025-10-15T12:57:44Z grpc.code=Internal grpc.error="rpc error: code = Internal desc = ERROR_CODE_CANCELLED" grpc.time_ms=110.273
```

**Example Application**
private repo

**Expected behavior**
Distributed cache should sync correctly across pods, and Permission.Check should return without hanging or timing out.

**Additional context**
- Running with a single replica (distributed disabled) works fine - no errors. Problem only appears with multiple replicas + distributed cache.
- Health checks: /healthz returns {"status":"SERVING"} for each pod.
- Networking: Verified with netstat and nc that ports 3476, 3478, and 5000 are listening and open between pods (cross-pod connectivity confirmed).
- RBAC: The service account has get/list/watch access for Services, Endpoints, and EndpointSlices.
- Service: Using a ClusterIP service exposing ports 3476, 3478, 5000. Selector is correct and routes to all pods.
- Config variations tested: PERMIFY_DISTRIBUTED_ADDRESS with and without port.
Tried both kubernetes:///<service-name>.<namespace>.svc.cluster.local and kubernetes:///<service-name>.<namespace>:5000.

**Environment (please complete the following information, because it helps us investigate better):**
 - OS: Kubernetes (GKE - 1.33.4)
 - Version: 1.4.5
 - Database: Postgres


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[BUG] Distributed cache mode causes Permission.Check to hang and timeout in Kubernetes #2559

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[BUG] Distributed cache mode causes Permission.Check to hang and timeout in Kubernetes #2559

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions