Skip to content

Consul peering state ACTIVE and LastHeartbeat null #22868

@diogenxs

Description

@diogenxs

Overview of the Issue

Consul peering failing with LastHeartbeat: null and LastReceived|LastSend stale, leading to stale DNS records and service catalog information, although the state is ACTIVE


I have been migrating some datacenters from WAN federation to peering, currently I have 13 datacenters peered with each other with all services exported between them ( trying to keep the same behavior as federation ).

Things were working fine until I moved the biggest datacenters in number of services to peering, after the migration I started seeing some issues with services not being discoverable via DNS or HTTP API, checking the peering status I could see that the peering between those two datacenters was failing with LastHeartbeat: null and LastReceived|LastSend stale.

ch1-gce ~500 services
dc2-gce ~350 services

The peering fails mostly between the two datacenters with more services, and from other datacenters to those sporadically.

The workaround that I found to keep things updated is to force a leader election in the dialer side of the peering, which makes the catalog to be updated in both sides, it updates the LastReceived|LastSend, but LastHeartbeat stays null, and catalog updates again only when forcing a new leader election.

I have checked and couldn't find anything resource related, like CPU, memory, file descriptors, disk IO or network issues.

Consul specific metrics seems fine as well, consul.raft.thread.main.saturation and consul.raft.thread.fsm.saturation are under 50% as the docs indicate.

All the servers runs on kubernetes, I'm not using peer through service-mesh, it dialing directly to the servers.

Reproduction Steps

I couldn't reproduce the issue locally, when creating the servers and some dummy clients and servers and the peering, everything worked fine. Actually the fact that I have more servers peered with those that are failing also makes me think that the problem is not configuration or environment related.

Consul info for both Client and Server

Server info | ch1-gce
/ $ consul info
agent:
        check_monitors = 0
        check_ttls = 0
        checks = 0
        services = 0
build:
        prerelease = 
        revision = 59b8b905
        version = 1.21.4
        version_metadata = 
consul:
        acl = enabled
        bootstrap = false
        known_datacenters = 1
        leader = false
        leader_addr = 10.97.134.90:18300
        server = true
raft:
        applied_index = 21997202
        commit_index = 21997202
        fsm_pending = 0
        last_contact = 82.26864ms
        last_log_index = 21997202
        last_log_term = 60
        last_snapshot_index = 21992289
        last_snapshot_term = 60
        latest_configuration = [{Suffrage:Voter ID:8932cb7c-a5ef-a93b-fa51-8d339dad344e Address:10.97.143.191:18300} {Suffrage:Voter ID:58985f1e-eabe-91ba-4c18-f0d8f416ac9a Address:10.97.134.90:18300} {Suffrage:Voter ID:16f93cd6-365f-f598-cc6f-865e925d0b63 Address:10.97.131.174:18300}]
        latest_configuration_index = 0
        num_peers = 2
        protocol_version = 3
        protocol_version_max = 3
        protocol_version_min = 0
        snapshot_version_max = 1
        snapshot_version_min = 0
        state = Follower
        term = 60
runtime:
        arch = amd64
        cpu_count = 32
        goroutines = 346735
        max_procs = 32
        os = linux
        version = go1.23.12
serf_lan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 49
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 33
        member_time = 111047
        members = 55
        query_queue = 0
        query_time = 1
serf_wan:
        coordinate_resets = 0
        encrypted = false
        event_queue = 0
        event_time = 1
        failed = 0
        health_score = 0
        intent_queue = 0
        left = 0
        member_time = 972
        members = 3
        query_queue = 0
        query_time = 1
acl.json: |-
  {
      "acl": {
          "enabled": true,
          "default_policy": "allow",
          "enable_token_persistence": true
      }
  }
central-config.json: |-
  {
    "enable_central_service_config": true
  }
debug.json: |-
  {
    "enable_debug": true
  }
server.json: |
  {
      "server": true,
      "bootstrap_expect": 3,
      "bind_addr": "0.0.0.0",
      "client_addr": "0.0.0.0",
      "domain": "consul.",
      "translate_wan_addrs": true,
      "ports": {
          "dns": 18600,
          "http": 8500,
          "grpc": 18400,
          "server": 18300,
          "serf_lan": 18301,
          "serf_wan": 18302
      },
      "http_config": {
          "response_headers": {
              "Access-Control-Allow-Origin": "*",
              "Access-Control-Allow-Methods": "*",
              "Access-Control-Allow-Headers": "*"
          }
      },
      "dns_config": {
          "enable_truncate": false,
          "soa": {
              "min_ttl": 1
            },
            "service_ttl": {
              "*": "1s"
            },
            "node_ttl": "1m"
      },
      "performance": {
          "leave_drain_time": "5s",
          "rpc_hold_timeout": "7s",
          "raft_multiplier": 1
      },
      "connect": {
          "enabled": true
      },
      "config_entries": {
          "bootstrap": [
              {
                  "kind": "proxy-defaults",
                  "name": "global",
                  "config": {
                      "protocol": "http"
                  }
              }
          ]
      },
      "limits": {
          "http_max_conns_per_client": 1000
      },
      "leave_on_terminate": true,
      "autopilot": {
          "min_quorum": 2,
          "disable_upgrade_migration": true
      }
  }
telemetry-config.json: |-
  {
    "telemetry": {
      "prometheus_retention_time": "1m",
      "disable_hostname": true,
      "enable_host_metrics": false
    }
  }
ui-config.json: |-
  {
    "ui_config": {
      "metrics_provider": "prometheus",
      "metrics_proxy": {
        "base_url": "http://thanos.query.consul:10902"
      },
      "enabled": true
    }
  }"

Operating system and Environment details

  • OS: Container-Optimized OS from Google
  • Consul version: 1.21.4
  • Deployment method: consul-k8s 1.7.2 helm chart

Log Fragments

The only thing I could notice in the logs is a lot of messages like:

2025-10-03T15:58:56.679Z [TRACE] agent.server.grpc-api.peerstream.stream.subscriptions: skipping send of duplicate public event: dialer=false peer_id=eb015e97-d43d-d00a-0974-b4e2fd713c83 peer_name=lv1-tbm correlationID=exported-service:aits-react-agents-sidecar-proxy

The following is the logs from both datacenters ch1-gce and dc2-gce without these skipping send of duplicate public event messages, which is basically 80% of the logs.

$  cat consul-ch1-gce.log | grep "skipping send of duplicate public event" | wc -l 
408221
$  cat consul-ch1-gce.log  | wc -l 
413017

$  cat consul-dc2-gce.log  | wc -l 
747534
$  cat consul-dc2-gce.log | grep "skipping send of duplicate public event" | wc -l 
719556

logs.tar.gz

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions