-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Overview of the Issue
Consul peering failing with LastHeartbeat: null
and LastReceived|LastSend
stale, leading to stale DNS records and service catalog information, although the state is ACTIVE
I have been migrating some datacenters from WAN federation to peering, currently I have 13 datacenters peered with each other with all services exported between them ( trying to keep the same behavior as federation ).
Things were working fine until I moved the biggest datacenters in number of services to peering, after the migration I started seeing some issues with services not being discoverable via DNS or HTTP API, checking the peering status I could see that the peering between those two datacenters was failing with LastHeartbeat: null
and LastReceived|LastSend
stale.
ch1-gce ~500 services
dc2-gce ~350 services
The peering fails mostly between the two datacenters with more services, and from other datacenters to those sporadically.
The workaround that I found to keep things updated is to force a leader election in the dialer side of the peering, which makes the catalog to be updated in both sides, it updates the LastReceived|LastSend
, but LastHeartbeat
stays null, and catalog updates again only when forcing a new leader election.
I have checked and couldn't find anything resource related, like CPU, memory, file descriptors, disk IO or network issues.
Consul specific metrics seems fine as well, consul.raft.thread.main.saturation
and consul.raft.thread.fsm.saturation
are under 50% as the docs indicate.
All the servers runs on kubernetes, I'm not using peer through service-mesh, it dialing directly to the servers.
Reproduction Steps
I couldn't reproduce the issue locally, when creating the servers and some dummy clients and servers and the peering, everything worked fine. Actually the fact that I have more servers peered with those that are failing also makes me think that the problem is not configuration or environment related.
Consul info for both Client and Server
Server info | ch1-gce
/ $ consul info
agent:
check_monitors = 0
check_ttls = 0
checks = 0
services = 0
build:
prerelease =
revision = 59b8b905
version = 1.21.4
version_metadata =
consul:
acl = enabled
bootstrap = false
known_datacenters = 1
leader = false
leader_addr = 10.97.134.90:18300
server = true
raft:
applied_index = 21997202
commit_index = 21997202
fsm_pending = 0
last_contact = 82.26864ms
last_log_index = 21997202
last_log_term = 60
last_snapshot_index = 21992289
last_snapshot_term = 60
latest_configuration = [{Suffrage:Voter ID:8932cb7c-a5ef-a93b-fa51-8d339dad344e Address:10.97.143.191:18300} {Suffrage:Voter ID:58985f1e-eabe-91ba-4c18-f0d8f416ac9a Address:10.97.134.90:18300} {Suffrage:Voter ID:16f93cd6-365f-f598-cc6f-865e925d0b63 Address:10.97.131.174:18300}]
latest_configuration_index = 0
num_peers = 2
protocol_version = 3
protocol_version_max = 3
protocol_version_min = 0
snapshot_version_max = 1
snapshot_version_min = 0
state = Follower
term = 60
runtime:
arch = amd64
cpu_count = 32
goroutines = 346735
max_procs = 32
os = linux
version = go1.23.12
serf_lan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 49
failed = 0
health_score = 0
intent_queue = 0
left = 33
member_time = 111047
members = 55
query_queue = 0
query_time = 1
serf_wan:
coordinate_resets = 0
encrypted = false
event_queue = 0
event_time = 1
failed = 0
health_score = 0
intent_queue = 0
left = 0
member_time = 972
members = 3
query_queue = 0
query_time = 1
acl.json: |-
{
"acl": {
"enabled": true,
"default_policy": "allow",
"enable_token_persistence": true
}
}
central-config.json: |-
{
"enable_central_service_config": true
}
debug.json: |-
{
"enable_debug": true
}
server.json: |
{
"server": true,
"bootstrap_expect": 3,
"bind_addr": "0.0.0.0",
"client_addr": "0.0.0.0",
"domain": "consul.",
"translate_wan_addrs": true,
"ports": {
"dns": 18600,
"http": 8500,
"grpc": 18400,
"server": 18300,
"serf_lan": 18301,
"serf_wan": 18302
},
"http_config": {
"response_headers": {
"Access-Control-Allow-Origin": "*",
"Access-Control-Allow-Methods": "*",
"Access-Control-Allow-Headers": "*"
}
},
"dns_config": {
"enable_truncate": false,
"soa": {
"min_ttl": 1
},
"service_ttl": {
"*": "1s"
},
"node_ttl": "1m"
},
"performance": {
"leave_drain_time": "5s",
"rpc_hold_timeout": "7s",
"raft_multiplier": 1
},
"connect": {
"enabled": true
},
"config_entries": {
"bootstrap": [
{
"kind": "proxy-defaults",
"name": "global",
"config": {
"protocol": "http"
}
}
]
},
"limits": {
"http_max_conns_per_client": 1000
},
"leave_on_terminate": true,
"autopilot": {
"min_quorum": 2,
"disable_upgrade_migration": true
}
}
telemetry-config.json: |-
{
"telemetry": {
"prometheus_retention_time": "1m",
"disable_hostname": true,
"enable_host_metrics": false
}
}
ui-config.json: |-
{
"ui_config": {
"metrics_provider": "prometheus",
"metrics_proxy": {
"base_url": "http://thanos.query.consul:10902"
},
"enabled": true
}
}"
Operating system and Environment details
- OS:
Container-Optimized OS from Google
- Consul version:
1.21.4
- Deployment method:
consul-k8s 1.7.2 helm chart
Log Fragments
The only thing I could notice in the logs is a lot of messages like:
2025-10-03T15:58:56.679Z [TRACE] agent.server.grpc-api.peerstream.stream.subscriptions: skipping send of duplicate public event: dialer=false peer_id=eb015e97-d43d-d00a-0974-b4e2fd713c83 peer_name=lv1-tbm correlationID=exported-service:aits-react-agents-sidecar-proxy
The following is the logs from both datacenters ch1-gce
and dc2-gce
without these skipping send of duplicate public event
messages, which is basically 80% of the logs.
$ cat consul-ch1-gce.log | grep "skipping send of duplicate public event" | wc -l
408221
$ cat consul-ch1-gce.log | wc -l
413017
$ cat consul-dc2-gce.log | wc -l
747534
$ cat consul-dc2-gce.log | grep "skipping send of duplicate public event" | wc -l
719556