-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Overview of the Issue
Consul clients resolve DNS names given in retry_join
only once at agent startup and then cache the resulting IPs indefinitely. When Consul server pods (in Kubernetes) are restarted and receive new IP addresses (e.g. during node upgrades, StatefulSet restarts, evictions, autoscaling), clients — including newly restarted client pods — keep attempting to join the cluster using stale, previously resolved server IPs. They never trigger a fresh DNS lookup on subsequent retry attempts.
This results in clients being permanently unable to join until they are restarted at a point when DNS also returns the updated IPs (or until manual intervention). In Kubernetes environments where pod/server IP churn is normal, this makes Consul fragile during routine operations.
Root cause: The retry_join
logic resolves DNS hostnames once and does not re-resolve them on repeated join retries or on connection failures. DNS TTLs are not honored, and there is no periodic refresh of hostname-based seeds.
Reproduction Steps
Environment: Consul deployed via official Helm chart (consul-k8s
/ chart version 1.5.5) on GKE (Kubernetes v1.28.x).
- Deploy Consul with servers using explicit pod DNS names in
retry_join
:global: name: consul datacenter: dc1 domain: consul.example.com enabled: true logLevel: debug server: enabled: true replicas: 3 bootstrapExpect: 3 extraConfig: | { "retry_join": [ "consul-server-0.consul.example.com.", "consul-server-1.consul.example.com.", "consul-server-2.consul.example.com." ] } client: enabled: true
- Wait for cluster to become healthy. Initial DNS resolutions (example):
- consul-server-0 → 10.1.2.10
- consul-server-1 → 10.1.2.11
- consul-server-2 → 10.1.2.12
- Restart the server StatefulSet (simulating node upgrade / maintenance):
kubectl rollout restart statefulset/consul-server -n consul
- New server pod IPs appear (example):
- consul-server-0 → 10.1.5.20
- consul-server-1 → 10.1.5.21
- consul-server-2 → 10.1.5.22
- Restart client pods:
kubectl delete pods -l app=consul,component=client -n consul
- From a freshly restarted client pod, verify DNS now correctly returns the new IPs:
kubectl exec -it consul-client-xyz -n consul -- nslookup consul-server-0.consul.example.com. # Returns: 10.1.5.20
- Observe Consul client logs: it continues attempting to join using stale IPs (10.1.2.x) and never re-resolves hostnames on retry.
- Client remains stuck (e.g. init container never completes) indefinitely.
Actual Behavior
- Client agent resolves
retry_join
hostnames once (at its startup) to old server pod IPs. - On repeated join retry attempts, it reuses cached IPs; no subsequent DNS queries occur.
- Connection attempts fail permanently: dial tcp :8301 connection refused.
- System does not recover automatically even though DNS is already serving updated A records.
Expected Behavior
- On join retry (especially after connection refusal), hostname seeds should be re-resolved.
- DNS TTLs (or a configurable re-resolution interval) should be honored.
- Clients should eventually discover the new server pod IPs without manual restarts.
- Behavior should be robust under normal Kubernetes pod/network churn.
Impact
This blocks:
- GKE (or other cloud) node auto-upgrades
- Rolling restarts of Consul servers
- Autoscaling events
- Pod evictions / rescheduling
- Disaster recovery procedures
Severity: High for production Kubernetes environments. It can cause widespread Consul client unavailability until manual intervention.
Workarounds Attempted
- Using static IPs instead of hostnames (defeats flexibility; brittle).
- Manually restarting all client pods after server restarts (operationally expensive, race-prone).
- Relying on Kubernetes Services: not used here because pod-specific DNS names were required; even with Services, lack of periodic re-resolution could still be problematic if caching persists internally.
Suggested Solutions
- Add periodic DNS re-resolution for
retry_join
hostnames (e.g. every N seconds, configurable). - Re-resolve hostnames immediately after a failed connection to all cached addresses.
- Honor DNS record TTLs where available.
- Provide a configuration knob (e.g.
retry_join_dns_refresh_interval
). - (Longer term) Support a pluggable seed provider with dynamic refresh capability for Kubernetes.
Consul info for both Client and Server
(Representative structure below; actual redacted outputs can be supplied if required. Sensitive tokens removed.)
Client info
# Output from `consul info` (redacted example)
agent:
build_date = 2024-xx-xx
consul_version = 1.17.x
datacenter = dc1
node_id = <redacted>
node_name = consul-client-xyz
dns:
recursors = []
runtime:
arch = amd64
os = linux
...
# Client agent HCL config (derived from Helm + defaults; redacted)
datacenter = "dc1"
data_dir = "/consul/data"
retry_join = [
"consul-server-0.consul.example.com.",
"consul-server-1.consul.example.com.",
"consul-server-2.consul.example.com."
]
log_level = "DEBUG"
verify_outgoing = true
verify_server_hostname = true
ca_file = "/consul/tls/ca.pem"
auto_encrypt {
tls = true
}
ports {
grpc = 8502
}
Server info
# Output from `consul info` (redacted example)
agent:
server = true
bootstrap_expect = 3
consul_version = 1.17.x
datacenter = dc1
node_name = consul-server-0
peers = 2
raft:
protocol_version = 3
last_log_index = ...
...
# Server agent HCL config (as rendered)
server = true
bootstrap_expect = 3
datacenter = "dc1"
data_dir = "/consul/data"
retry_join = [
"consul-server-0.consul.example.com.",
"consul-server-1.consul.example.com.",
"consul-server-2.consul.example.com."
]
log_level = "DEBUG"
verify_outgoing = true
verify_server_hostname = true
ca_file = "/consul/tls/ca.pem"
auto_encrypt {
allow_tls = true
}
limits {
http_max_conns_per_client = 1000
rpc_max_conns_per_client = 1000
}
Operating system and Environment details
- Platform: GKE (Google Kubernetes Engine)
- Kubernetes Version: v1.28.x
- Consul Helm chart (consul-k8s): 1.5.5
- Consul image version: (from chart defaults; presumed 1.17.x lineage — can supply exact digest if required)
- CNI: Default GKE networking
- Architecture: amd64
- TLS + ACLs partially enabled (ACL system tokens managed externally; bootstrap token provided)
Log Fragments
Freshly restarted client (after server pod IP changes):
[DEBUG] agent: Starting Consul agent (fresh restart)
[INFO] agent: Consul agent running!
[DEBUG] agent: Retry join is supported for the following discovery methods: cluster_addr, aliyun, aws, azure, digitalocean, gce, k8s, linode, mdns, os, scaleway, triton, vsphere
[INFO] agent: Joining cluster...
[DEBUG] agent: (LAN) joining: [consul-server-0.consul.example.com.:8301 consul-server-1.consul.example.com.:8301 consul-server-2.consul.example.com.:8301]
[DEBUG] agent: Resolved consul-server-0.consul.example.com.:8301 to 10.1.2.10:8301
[DEBUG] agent: Resolved consul-server-1.consul.example.com.:8301 to 10.1.2.11:8301
[DEBUG] agent: Resolved consul-server-2.consul.example.com.:8301 to 10.1.2.12:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.10:8301: connect: connection refused" address=10.1.2.10:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.11:8301: connect: connection refused" address=10.1.2.11:8301
[ERROR] agent: failed to join: error="dial tcp 10.1.2.12:8301: connect: connection refused" address=10.1.2.12:8301
[WARN] agent: Join failed: error="3 errors occurred:
* dial tcp 10.1.2.10:8301: connection refused
* dial tcp 10.1.2.11:8301: connection refused
* dial tcp 10.1.2.12:8301: connection refused"
[DEBUG] agent: (LAN) joining: [consul-server-0.consul.example.com.:8301 consul-server-1.consul.example.com.:8301 consul-server-2.consul.example.com.:8301]
[ERROR] agent: failed to join: error="dial tcp 10.1.2.10:8301: connect: connection refused" address=10.1.2.10:8301
...
# (Loop repeats; no new DNS resolution lines appear)
Manual DNS lookups (same pod; confirms updated IPs exist and are resolvable):
$ nslookup consul-server-0.consul.example.com.
Address 1: 10.1.5.20
$ nslookup consul-server-1.consul.example.com.
Address 1: 10.1.5.21
$ nslookup consul-server-2.consul.example.com.
Address 1: 10.1.5.22
Additional Context
The absence of dynamic re-resolution makes retry_join
fragile for Kubernetes StatefulSet-based servers. In typical operational patterns (pod churn, rolling restarts, auto-upgrades), this leads to cascading client unavailability. Other distributed systems (e.g. etcd, Nomad with retry join via providers) perform periodic re-resolution or have provider-based discovery that refreshes endpoints. Consul’s static caching here appears to be an outlier and operational risk.
Request
Please:
- Confirm whether current behavior is intentional or a bug.
- Advise if any hidden configuration exists to force periodic DNS re-resolution.
- Consider implementing one of the suggested solutions (even a minimal periodic refresh) to make Consul more resilient in Kubernetes.
Happy to provide additional sanitized logs, full rendered configs, or test a development build with enhanced retry logic.
Thank you.