-
Notifications
You must be signed in to change notification settings - Fork 353
Description
What version were you using?
helm version: 1.2.1
Nats: nats:2.10.17-alpine
What environment was the server running in?
Azure AKS k8s version 1.28.9
Nats chart included as sub-chart of main release, so the following config gets merged with nat's default values.yaml
Nats section of the values.yaml
nats:
# ========================== My Org's custom nats config =========================
# Whether or not to install the nats subchart to the "nats" namespace
enabled: false
# FQDN for bridge agents to connect to nats
# Example: ORG-bridge.company.com
bridgeAddress:
# Name of the secret containing the TLS certificate. Defines a helm anchor so
# that this value can be passed to the nats chart below.
tlsCertSecretName: &tlsCertSecretName bridge-ingress-cert
# =========================== Nats Helm Chart Config ============================
# Expose service as a public IP
service:
merge:
spec:
type: LoadBalancer
# Config for nats server
config:
# Flexible config block that gets merged into nats.conf
merge:
# Resolver.conf config for memory resolver. Set with:
# --set nats.config.merge.operator=$NATS_OPERATOR_JWT \
# --set nats.config.merge.system_account=$NATS_SYS_ACCOUNT_ID \
# --set nats.config.merge.resolver_preload.$NATS_SYS_ACCOUNT_ID=$NATS_SYS_ACCOUNT_JWT
operator: OPERATOR_JWT
system_account: SYSTEM_ACCOUNT_ID
resolver: MEM
resolver_preload:
# SYSTEM_ACCOUNT_ID: SYSTEM_ACCOUNT_JWT
nats:
tls:
enabled: true
secretName: *tlsCertSecretName
merge:
timeout: 50
monitor:
tls:
enabled: true
# Run resolver: https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro/jwt/resolver
resolver:
enabled: true
The *tlsCertSecretName
is a Helm alias to the name of a manually deployed Cert:
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
name: ORG-bridge-ingress-cert
spec:
secretName: {{ .Values.nats.tlsCertSecretName }}
duration: 2160h # 90 days
renewBefore: 240h # 10 days
usages:
- digital signature
- key encipherment
# - server auth
# - client auth # included because routes mutually verify each other
issuerRef:
name: {{ .Values.certManager.issuerName }}
# name: staging-issuer
kind: ClusterIssuer
commonName: "{{ .Values.nats.bridgeAddress }}"
dnsNames:
- "{{ .Values.nats.bridgeAddress }}"
Is this defect reproducible?
Reliably occurs with the above config on our AKS cluster. Haven't tested on other K8s varieties.
Given the capability you are leveraging, describe your expectation?
I am attempting to run a simple single node nats instance without clustering (at this stage). The goal is to deploy it with a pre-configured operator, account, and user, and TLS required for connections coming in from outside the K8s cluster.
Further down the line I will need to configure clustering, but TLS and mem-resolver based auth are the requirements for this stage of work.
I'd like to get the pod running healthily without throwing TLS errors.
I was on the fence whether to label this as a defect, I'm not sure if this is just a misunderstanding on my part of how TLS is supposed to be configured for the Nats helm chart. Happy to change the category if it looks like the error's on my end, I could just really use some guidance on how to get this sorted!
Given the expectation, what is the defect you are observing?
With the above config, I get the following in the nats pod logs. It seems as though the nats pod is attempting to dial pods in the kube-system
namespace but failing to do so (assuming the directionality from the -> in the log message).
k logs ORG-nats-0 -f
Defaulted container "nats" out of: nats, reloader
[6] 2024/07/26 18:52:25.460906 [INF] Starting nats-server
[6] 2024/07/26 18:52:25.460957 [INF] Version: 2.10.17
[6] 2024/07/26 18:52:25.460959 [INF] Git: [b91de03]
[6] 2024/07/26 18:52:25.460961 [INF] Name: ORG-nats-0
[6] 2024/07/26 18:52:25.460963 [INF] ID: NCN2JEE4LEXOZNC5OSYDVX6EW23EPPLLMNWPXOMINE3A564MUC77764W
[6] 2024/07/26 18:52:25.460968 [INF] Using configuration file: /etc/nats-config/nats.conf
[6] 2024/07/26 18:52:25.460970 [INF] Trusted Operators
[6] 2024/07/26 18:52:25.460971 [INF] System : ""
[6] 2024/07/26 18:52:25.460973 [INF] Operator: "ORG"
[6] 2024/07/26 18:52:25.460975 [INF] Issued : 2024-07-15 16:39:52 +0000 UTC
[6] 2024/07/26 18:52:25.460989 [INF] Expires : Never
[6] 2024/07/26 18:52:25.462384 [INF] Starting http monitor on 0.0.0.0:8222
[6] 2024/07/26 18:52:25.462518 [INF] Listening for client connections on 0.0.0.0:4222
[6] 2024/07/26 18:52:25.462528 [INF] TLS required for client connections
[6] 2024/07/26 18:52:25.462678 [INF] Server is ready
[6] 2024/07/26 18:52:49.307129 [ERR] 10.224.0.16:64083 - cid:5 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.16:64083: read: connection reset by peer
[6] 2024/07/26 18:52:49.601709 [ERR] 10.244.11.1:22990 - cid:6 - TLS handshake error: read tcp 10.244.11.6:4222->10.244.11.1:22990: read: connection reset by peer
[6] 2024/07/26 18:52:50.421799 [ERR] 10.224.0.12:15840 - cid:7 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.12:15840: read: connection reset by peer
[6] 2024/07/26 18:52:50.656785 [ERR] 10.224.0.4:59517 - cid:8 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.4:59517: read: connection reset by peer
[6] 2024/07/26 18:52:50.669588 [ERR] 10.224.0.7:35999 - cid:9 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.7:35999: read: connection reset by peer
[6] 2024/07/26 18:52:50.674150 [ERR] 10.224.0.14:46192 - cid:10 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.14:46192: read: connection reset by peer
[6] 2024/07/26 18:52:50.738035 [ERR] 10.224.0.17:39583 - cid:11 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.17:39583: read: connection reset by peer
[6] 2024/07/26 18:52:51.211579 [ERR] 10.224.0.15:32887 - cid:12 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.15:32887: read: connection reset by peer
[6] 2024/07/26 18:52:51.408095 [ERR] 10.224.0.13:45266 - cid:13 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.13:45266: read: connection reset by peer
[6] 2024/07/26 18:52:55.309357 [ERR] 10.224.0.16:6606 - cid:14 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.16:6606: read: connection reset by peer
.....
10.244.14.232
is the pod IP of the nats instnace
ORG-nats-0 2/2 Running 0 15m 10.244.14.232 aks-default-36350781-vmss000068 <none> <none>
All of the IPs that it attempting to dial look like kube-system
pods, some of which share the same internal cluster ip
cloud-node-manager-nsbpx 1/1 Running 0 26d 10.224.0.12 aks-amd64-25378925-vmss000004 <none> <none>
cloud-node-manager-qbhbs 1/1 Running 0 130m 10.224.0.10 aks-default-36350781-vmss0000a8 <none> <none>
coredns-6745896b65-fv8df 1/1 Running 0 7d22h 10.244.14.170 aks-default-36350781-vmss000068 <none> <none>
coredns-6745896b65-mtnmx 1/1 Running 0 17d 10.244.22.17 aks-amd64-25378925-vmss000000 <none> <none>
coredns-6745896b65-tn749 1/1 Running 0 17d 10.244.0.13 aks-amd64-25378925-vmss000010 <none> <none>
....
kube-proxy-2scnf 1/1 Running 0 23d 10.224.0.14 aks-amd64-25378925-vmss000010 <none> <none>
kube-proxy-b9m4d 1/1 Running 0 26d 10.224.0.7 aks-amd64-25378925-vmss000000 <none> <none>
kube-proxy-fmncs 1/1 Running 0 26d 10.224.0.12 aks-amd64-25378925-vmss000004 <none> <none>
kube-proxy-gk9dj 1/1 Running 0 10d 10.224.0.16 aks-amd64-25378925-vmss000011 <none> <none>
kube-proxy-h9vcs 1/1 Running 0 27h 10.224.0.17 aks-default-36350781-vmss0000a4 <none> <none>
....
This is where my understanding starts to break down, I'm not sure why nats would be dialing pods in the kube-system
namespace. Would really appreciate any guidance you can provide! It almost feels like setting the nats.tls.enabled
broke something with nats talking to other in-cluster pods.
I looked at examples like https://gist.github.com/wallyqs/d9c9131a5bd5e247b2e4a6d4aac898af, but it seemed as though they were aimed at nats clustering. As far as I understand, it doesn't seem like I need self-signed routes config for a single pod deploy?