Consul clients cache retry_join DNS results forever, breaking joins after server restarts

#### Overview of the Issue

Consul clients resolve DNS names given in `retry_join` only once at agent startup and then cache the resulting IPs indefinitely. When Consul server pods (in Kubernetes) are restarted and receive new IP addresses (e.g. during node upgrades, StatefulSet restarts, evictions, autoscaling), clients — including newly restarted client pods — keep attempting to join the cluster using stale, previously resolved server IPs. They never trigger a fresh DNS lookup on subsequent retry attempts.

This results in clients being permanently unable to join until they are restarted at a point when DNS also returns the updated IPs (or until manual intervention). In Kubernetes environments where pod/server IP churn is normal, this makes Consul fragile during routine operations.

Root cause: The `retry_join` logic resolves DNS hostnames once and does not re-resolve them on repeated join retries or on connection failures. DNS TTLs are not honored, and there is no periodic refresh of hostname-based seeds.

---

#### Reproduction Steps

Environment: Consul deployed via official Helm chart (`consul-k8s` / chart version 1.5.5) on GKE (Kubernetes v1.28.x).

1. Deploy Consul with servers using explicit pod DNS names in `retry_join`:
   ```yaml
   global:
     name: consul
     datacenter: dc1
     domain: consul.example.com
     enabled: true
     logLevel: debug
   server:
     enabled: true
     replicas: 3
     bootstrapExpect: 3
     extraConfig: |
       {
         "retry_join": [
           "consul-server-0.consul.example.com.",
           "consul-server-1.consul.example.com.",
           "consul-server-2.consul.example.com."
         ]
       }
   client:
     enabled: true
   ```
2. Wait for cluster to become healthy. Initial DNS resolutions (example):
   - consul-server-0 → 10.1.2.10
   - consul-server-1 → 10.1.2.11
   - consul-server-2 → 10.1.2.12
3. Restart the server StatefulSet (simulating node upgrade / maintenance):
   ```bash
   kubectl rollout restart statefulset/consul-server -n consul
   ```
4. New server pod IPs appear (example):
   - consul-server-0 → 10.1.5.20
   - consul-server-1 → 10.1.5.21
   - consul-server-2 → 10.1.5.22
5. Restart client pods:
   ```bash
   kubectl delete pods -l app=consul,component=client -n consul
   ```
6. From a freshly restarted client pod, verify DNS now correctly returns the new IPs:
   ```bash
   kubectl exec -it consul-client-xyz -n consul -- nslookup consul-server-0.consul.example.com.
   # Returns: 10.1.5.20
   ```
7. Observe Consul client logs: it continues attempting to join using stale IPs (10.1.2.x) and never re-resolves hostnames on retry.
8. Client remains stuck (e.g. init container never completes) indefinitely.

---

### Actual Behavior

- Client agent resolves `retry_join` hostnames once (at its startup) to old server pod IPs.
- On repeated join retry attempts, it reuses cached IPs; no subsequent DNS queries occur.
- Connection attempts fail permanently: dial tcp <old-ip>:8301 connection refused.
- System does not recover automatically even though DNS is already serving updated A records.

### Expected Behavior

- On join retry (especially after connection refusal), hostname seeds should be re-resolved.
- DNS TTLs (or a configurable re-resolution interval) should be honored.
- Clients should eventually discover the new server pod IPs without manual restarts.
- Behavior should be robust under normal Kubernetes pod/network churn.

### Impact

This blocks:
- GKE (or other cloud) node auto-upgrades
- Rolling restarts of Consul servers
- Autoscaling events
- Pod evictions / rescheduling
- Disaster recovery procedures

Severity: High for production Kubernetes environments. It can cause widespread Consul client unavailability until manual intervention.

### Workarounds Attempted

- Using static IPs instead of hostnames (defeats flexibility; brittle).
- Manually restarting all client pods after server restarts (operationally expensive, race-prone).
- Relying on Kubernetes Services: not used here because pod-specific DNS names were required; even with Services, lack of periodic re-resolution could still be problematic if caching persists internally.

### Suggested Solutions

1. Add periodic DNS re-resolution for `retry_join` hostnames (e.g. every N seconds, configurable).
2. Re-resolve hostnames immediately after a failed connection to all cached addresses.
3. Honor DNS record TTLs where available.
4. Provide a configuration knob (e.g. `retry_join_dns_refresh_interval`).
5. (Longer term) Support a pluggable seed provider with dynamic refresh capability for Kubernetes.

---

### Consul info for both Client and Server

(Representative structure below; actual redacted outputs can be supplied if required. Sensitive tokens removed.)

<details>
  <summary>Client info</summary>

```
# Output from `consul info` (redacted example)
agent:
  build_date = 2024-xx-xx
  consul_version = 1.17.x
  datacenter = dc1
  node_id = <redacted>
  node_name = consul-client-xyz
dns:
  recursors = []
runtime:
  arch = amd64
  os = linux
...
```

```
# Client agent HCL config (derived from Helm + defaults; redacted)
datacenter = "dc1"
data_dir  = "/consul/data"
retry_join = [
  "consul-server-0.consul.example.com.",
  "consul-server-1.consul.example.com.",
  "consul-server-2.consul.example.com."
]
log_level = "DEBUG"
verify_outgoing = true
verify_server_hostname = true
ca_file = "/consul/tls/ca.pem"
auto_encrypt {
  tls = true
}
ports {
  grpc = 8502
}
```

</details>

<details>
  <summary>Server info</summary>

```
# Output from `consul info` (redacted example)
agent:
  server = true
  bootstrap_expect = 3
  consul_version = 1.17.x
  datacenter = dc1
  node_name = consul-server-0
peers = 2
raft:
  protocol_version = 3
  last_log_index = ...
...
```

```
# Server agent HCL config (as rendered)
server        = true
bootstrap_expect = 3
datacenter    = "dc1"
data_dir      = "/consul/data"
retry_join = [
  "consul-server-0.consul.example.com.",
  "consul-server-1.consul.example.com.",
  "consul-server-2.consul.example.com."
]
log_level     = "DEBUG"
verify_outgoing = true
verify_server_hostname = true
ca_file       = "/consul/tls/ca.pem"
auto_encrypt {
  allow_tls = true
}
limits {
  http_max_conns_per_client = 1000
  rpc_max_conns_per_client  = 1000
}
```

</details>

### Operating system and Environment details

- Platform: GKE (Google Kubernetes Engine)
- Kubernetes Version: v1.28.x
- Consul Helm chart (consul-k8s): 1.5.5
- Consul image version: (from chart defaults; presumed 1.17.x lineage — can supply exact digest if required)
- CNI: Default GKE networking
- Architecture: amd64
- TLS + ACLs partially enabled (ACL system tokens managed externally; bootstrap token provided)

### Log Fragments

Freshly restarted client (after server pod IP changes):

```
[DEBUG] agent: Starting Consul agent (fresh restart)
[INFO]  agent: Consul agent running!
[DEBUG] agent: Retry join is supported for the following discovery methods: cluster_addr, aliyun, aws, azure, digitalocean, gce, k8s, linode, mdns, os, scaleway, triton, vsphere
[INFO]  agent: Joining cluster...
[DEBUG] agent: (LAN) joining: [consul-server-0.consul.example.com.:8301 consul-server-1.consul.example.com.:8301 consul-server-2.consul.example.com.:8301]

[DEBUG] agent: Resolved consul-server-0.consul.example.com.:8301 to 10.1.2.10:8301
[DEBUG] agent: Resolved consul-server-1.consul.example.com.:8301 to 10.1.2.11:8301
[DEBUG] agent: Resolved consul-server-2.consul.example.com.:8301 to 10.1.2.12:8301

[ERROR] agent: failed to join: error="dial tcp 10.1.2.10:8301: connect: connection refused" address=10.1.2.10:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.11:8301: connect: connection refused" address=10.1.2.11:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.12:8301: connect: connection refused" address=10.1.2.12:8301
[WARN]  agent: Join failed: error="3 errors occurred:
    * dial tcp 10.1.2.10:8301: connection refused
    * dial tcp 10.1.2.11:8301: connection refused
    * dial tcp 10.1.2.12:8301: connection refused"

[DEBUG] agent: (LAN) joining: [consul-server-0.consul.example.com.:8301 consul-server-1.consul.example.com.:8301 consul-server-2.consul.example.com.:8301]
[ERROR] agent: failed to join: error="dial tcp 10.1.2.10:8301: connect: connection refused" address=10.1.2.10:8301
...
# (Loop repeats; no new DNS resolution lines appear)
```

Manual DNS lookups (same pod; confirms updated IPs exist and are resolvable):
```
$ nslookup consul-server-0.consul.example.com.
Address 1: 10.1.5.20
$ nslookup consul-server-1.consul.example.com.
Address 1: 10.1.5.21
$ nslookup consul-server-2.consul.example.com.
Address 1: 10.1.5.22
```

### Additional Context

The absence of dynamic re-resolution makes `retry_join` fragile for Kubernetes StatefulSet-based servers. In typical operational patterns (pod churn, rolling restarts, auto-upgrades), this leads to cascading client unavailability. Other distributed systems (e.g. etcd, Nomad with retry join via providers) perform periodic re-resolution or have provider-based discovery that refreshes endpoints. Consul’s static caching here appears to be an outlier and operational risk.

### Request

Please:
1. Confirm whether current behavior is intentional or a bug.
2. Advise if any hidden configuration exists to force periodic DNS re-resolution.
3. Consider implementing one of the suggested solutions (even a minimal periodic refresh) to make Consul more resilient in Kubernetes.

Happy to provide additional sanitized logs, full rendered configs, or test a development build with enhanced retry logic.

Thank you.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consul clients cache retry_join DNS results forever, breaking joins after server restarts #22753

Overview of the Issue

Reproduction Steps

Actual Behavior

Expected Behavior

Impact

Workarounds Attempted

Suggested Solutions

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

Additional Context

Request

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consul clients cache retry_join DNS results forever, breaking joins after server restarts #22753

Description

Overview of the Issue

Reproduction Steps

Actual Behavior

Expected Behavior

Impact

Workarounds Attempted

Suggested Solutions

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

Additional Context

Request

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions