Skip to content

database-peers relation missing tls=enabled on one unit causes HTTP health checks against HTTPS Patroni API #1305

@nsakkos

Description

@nsakkos

In a 3-unit postgresql cluster, integrated with self-signed-certificates for TLS, two units had tls: enabled in the database-peers relation data while the third unit did not. This manifested as "member awaiting to start" on the problematic unit and the following errors on patroni logs of each unit, which showed something was sending plain HTTP to the Patroni API:

...
SSL: TLSV1_ALERT_UNKNOWN_CA] tlsv1 alert unknown ca 
...
SSL: HTTP_REQUEST] http request

On the affected unit, the Patroni API had TLS configured and was serving the expected certificate (tested manually), but the charm was unaware of the TLS configuration and was attempting Patroni health checks over http on port 8008, leading on the repeated errors mentioned above.

Following the recommendations of @marceloneppel and @dragomirp, the following workaround solved the issue:

# check for units missing "tls: enabled"
juju show-unit postgresql/X --endpoint database-peers

# To grep the relation id for the next command.
juju show-unit postgresql/X --endpoint=database-peers | grep relation-id

juju exec --unit postgresql/X 'relation-set -r RELATION-ID-FROM-ABOVE tls="enabled"'

where postgresql/X was the faulty unit. After this, the charm started using https for Patroni, Patroni stopped logging errors, health checks began working correctly and the unit appeared healthy in juju status

Expected behavior

When TLS is enabled for the application (via the certificates relation/self-signed-certificates) and Patroni is configured with TLS, all units should have consistent tls: enabled state in the database-peers relation.

Actual behavior

Occasionally, units end up with Patroni configured with TLS but no tls: enabled flag on the database-peers relation.

Versions

Juju CLI: 3.6.11

Juju agent: 3.6.11

Charm revision: 14/stable rev 553, base 22.04

self-signed-certificates: latest/edge, rev 419

Log output

Juju debug log:

tcpdump on the affected unit

GET /cluster HTTP/1.1
Host: <local_IP>:8008
User-Agent: python-requests/2.32.3
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Authorization: Basic 

GET /health HTTP/1.1
Host: 10.146.65.77:8008
User-Agent: python-requests/2.32.3
...
Authorization: Basic 

ss -tnp "dport = :8008" while seeing the error

SYN-SENT 0      1       10.146.65.77:46374 10.146.65.69:8008 users:(("python",pid=1058948,fd=4))
SYN-SENT 0      1       10.146.65.77:46080 10.146.65.69:8008 users:(("python",pid=1058948,fd=4))
CLOSE-WAIT 361    0       10.146.65.77:49812 10.146.65.72:8008 users:(("python",pid=1058948,fd=4))


$ ps aux | grep 1058948 
root     1058948  0.5  0.9 167436 74708 ?        Sl   14:52   0:00 /var/lib/juju/agents/unit-postgresql-2/charm/venv/bin/python /var/lib/juju/agents/unit-postgresql-2/charm/src/charm.py

juju debug log replay.txt

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working as expected

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions