Consul peering state ACTIVE  and LastHeartbeat null

#### Overview of the Issue

Consul peering failing with `LastHeartbeat: null` and `LastReceived|LastSend` stale, leading to stale DNS records and service catalog information, although the state is `ACTIVE`

---

I have been migrating some datacenters from WAN federation to peering, currently I have 13 datacenters peered with each other with all services exported between them ( trying to keep the same behavior as federation ).

Things were working fine until I moved the biggest datacenters in number of services to peering, after the migration I started seeing some issues with services not being discoverable via DNS or HTTP API, checking the peering status I could see that the peering between those two datacenters was failing with `LastHeartbeat: null` and `LastReceived|LastSend` stale.

```
ch1-gce ~500 services
dc2-gce ~350 services
```

The peering fails mostly between the two datacenters with more services, and from other datacenters to those sporadically.

The workaround that I found to keep things updated is to force a leader election in the dialer side of the peering, which makes the catalog to be updated in both sides, it updates the `LastReceived|LastSend`, but `LastHeartbeat` stays null, and catalog updates again only when forcing a new leader election.


I have checked and couldn't find anything resource related, like CPU, memory, file descriptors, disk IO or network issues.

Consul specific metrics seems fine as well, `consul.raft.thread.main.saturation` and `consul.raft.thread.fsm.saturation` are under 50% as the docs indicate.

All the servers runs on kubernetes, I'm not using peer through service-mesh, it dialing directly to the servers.


#### Reproduction Steps

I couldn't reproduce the issue locally, when creating the servers and some dummy clients and servers and the peering, everything worked fine. Actually the fact that I have more servers peered with those that are failing also makes me think that the problem is not configuration or environment related.

### Consul info for both Client and Server


<details>
  <summary>Server info | ch1-gce </summary>

```
/ $ consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = 
        revision = 59b8b905
        version = 1.21.4
        version_metadata = 
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 10.97.134.90:18300
        server = true
raft:
        applied_index = 21997202
        commit_index = 21997202
        fsm_pending = 0
        last_contact = 82.26864ms
        last_log_index = 21997202
        last_log_term = 60
        last_snapshot_index = 21992289
        last_snapshot_term = 60
        latest_configuration = [{Suffrage:Voter ID:8932cb7c-a5ef-a93b-fa51-8d339dad344e Address:10.97.143.191:18300} {Suffrage:Voter ID:58985f1e-eabe-91ba-4c18-f0d8f416ac9a Address:10.97.134.90:18300} {Suffrage:Voter ID:16f93cd6-365f-f598-cc6f-865e925d0b63 Address:10.97.131.174:18300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 60
runtime:
        arch = amd64
        cpu_count = 32
        goroutines = 346735
        max_procs = 32
        os = linux
        version = go1.23.12
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 49
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 33
        member_time = 111047
        members = 55
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 972
        members = 3
        query_queue = 0
        query_time = 1
```

```
acl.json: |-
  {
      "acl": {
          "enabled": true,
          "default_policy": "allow",
          "enable_token_persistence": true
      }
  }
central-config.json: |-
  {
    "enable_central_service_config": true
  }
debug.json: |-
  {
    "enable_debug": true
  }
server.json: |
  {
      "server": true,
      "bootstrap_expect": 3,
      "bind_addr": "0.0.0.0",
      "client_addr": "0.0.0.0",
      "domain": "consul.",
      "translate_wan_addrs": true,
      "ports": {
          "dns": 18600,
          "http": 8500,
          "grpc": 18400,
          "server": 18300,
          "serf_lan": 18301,
          "serf_wan": 18302
      },
      "http_config": {
          "response_headers": {
              "Access-Control-Allow-Origin": "*",
              "Access-Control-Allow-Methods": "*",
              "Access-Control-Allow-Headers": "*"
          }
      },
      "dns_config": {
          "enable_truncate": false,
          "soa": {
              "min_ttl": 1
            },
            "service_ttl": {
              "*": "1s"
            },
            "node_ttl": "1m"
      },
      "performance": {
          "leave_drain_time": "5s",
          "rpc_hold_timeout": "7s",
          "raft_multiplier": 1
      },
      "connect": {
          "enabled": true
      },
      "config_entries": {
          "bootstrap": [
              {
                  "kind": "proxy-defaults",
                  "name": "global",
                  "config": {
                      "protocol": "http"
                  }
              }
          ]
      },
      "limits": {
          "http_max_conns_per_client": 1000
      },
      "leave_on_terminate": true,
      "autopilot": {
          "min_quorum": 2,
          "disable_upgrade_migration": true
      }
  }
telemetry-config.json: |-
  {
    "telemetry": {
      "prometheus_retention_time": "1m",
      "disable_hostname": true,
      "enable_host_metrics": false
    }
  }
ui-config.json: |-
  {
    "ui_config": {
      "metrics_provider": "prometheus",
      "metrics_proxy": {
        "base_url": "http://thanos.query.consul:10902"
      },
      "enabled": true
    }
  }"
```

</details>

### Operating system and Environment details

- OS: `Container-Optimized OS from Google`
- Consul version: `1.21.4`
- Deployment method: `consul-k8s 1.7.2 helm chart`

### Log Fragments

The only thing I could notice in the logs is a lot of messages like:

```
2025-10-03T15:58:56.679Z [TRACE] agent.server.grpc-api.peerstream.stream.subscriptions: skipping send of duplicate public event: dialer=false peer_id=eb015e97-d43d-d00a-0974-b4e2fd713c83 peer_name=lv1-tbm correlationID=exported-service:aits-react-agents-sidecar-proxy
```

The following is the logs from both datacenters `ch1-gce` and `dc2-gce` without these `skipping send of duplicate public event` messages, which is basically 80% of the logs.

```
$  cat consul-ch1-gce.log | grep "skipping send of duplicate public event" | wc -l 
408221
$  cat consul-ch1-gce.log  | wc -l 
413017

$  cat consul-dc2-gce.log  | wc -l 
747534
$  cat consul-dc2-gce.log | grep "skipping send of duplicate public event" | wc -l 
719556
```

[logs.tar.gz](https://github.com/user-attachments/files/22684358/logs.tar.gz)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Consul peering state ACTIVE and LastHeartbeat null #22868

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Consul peering state ACTIVE and LastHeartbeat null #22868

Description

Overview of the Issue

Reproduction Steps

Consul info for both Client and Server

Operating system and Environment details

Log Fragments

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions