TLS Handshake error between Nats pod and `kube-system` pods

### What version were you using?

helm version: 1.2.1
Nats: nats:2.10.17-alpine

### What environment was the server running in?

Azure AKS k8s version 1.28.9

Nats chart included as sub-chart of main release, so the following config gets merged with nat's default `values.yaml`

Nats section of the `values.yaml`
```
nats:
  # ========================== My Org's custom nats config =========================
  # Whether or not to install the nats subchart to the "nats" namespace
  enabled: false

  # FQDN for bridge agents to connect to nats
  # Example: ORG-bridge.company.com
  bridgeAddress: 
  
  # Name of the secret containing the TLS certificate. Defines a helm anchor so
  # that this value can be passed to the nats chart below.
  tlsCertSecretName: &tlsCertSecretName bridge-ingress-cert

  # =========================== Nats Helm Chart Config ============================
  # Expose service as a public IP
  service:
    merge:
      spec:
        type: LoadBalancer

  # Config for nats server
  config:
    # Flexible config block that gets merged into nats.conf
    merge:
      # Resolver.conf config for memory resolver. Set with:
      # --set nats.config.merge.operator=$NATS_OPERATOR_JWT \
      # --set nats.config.merge.system_account=$NATS_SYS_ACCOUNT_ID \
      # --set nats.config.merge.resolver_preload.$NATS_SYS_ACCOUNT_ID=$NATS_SYS_ACCOUNT_JWT
      operator: OPERATOR_JWT
      system_account: SYSTEM_ACCOUNT_ID
      resolver: MEM
      resolver_preload:
        # SYSTEM_ACCOUNT_ID: SYSTEM_ACCOUNT_JWT

    nats:
      tls:
        enabled: true
        secretName: *tlsCertSecretName
        merge:
          timeout: 50

    monitor:
      tls:
        enabled: true

    # Run resolver: https://docs.nats.io/running-a-nats-service/configuration/securing_nats/auth_intro/jwt/resolver
    resolver:
      enabled: true
```

The `*tlsCertSecretName` is a Helm alias to the name of a manually deployed Cert:

```
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: ORG-bridge-ingress-cert
spec:
  secretName: {{ .Values.nats.tlsCertSecretName }}
  duration: 2160h # 90 days
  renewBefore: 240h # 10 days
  usages:
  - digital signature
  - key encipherment
  # - server auth
  # - client auth # included because routes mutually verify each other
  issuerRef:
    name: {{ .Values.certManager.issuerName }}
    # name: staging-issuer
    kind: ClusterIssuer
  commonName: "{{ .Values.nats.bridgeAddress }}"
  dnsNames:
  - "{{ .Values.nats.bridgeAddress }}"
```


### Is this defect reproducible?

Reliably occurs with the above config on our AKS cluster. Haven't tested on other K8s varieties.

### Given the capability you are leveraging, describe your expectation?

I am attempting to run a simple single node nats instance without clustering (at this stage). The goal is to deploy it with a pre-configured operator, account, and user, and TLS required for connections coming in from outside the K8s cluster.

Further down the line I will need to configure clustering, but TLS and mem-resolver based auth are the requirements for this stage of work.

I'd like to get the pod running healthily without throwing TLS errors.

I was on the fence whether to label this as a defect, I'm not sure if this is just a misunderstanding on my part of how TLS is supposed to be configured for the Nats helm chart. Happy to change the category if it looks like the error's on my end, I could just really use some guidance on how to get this sorted!


### Given the expectation, what is the defect you are observing?

With the above config, I get the following in the nats pod logs. It seems as though the nats pod is attempting to dial pods in the `kube-system` namespace but failing to do so (assuming the directionality from the -> in the log message).

```bash
k logs ORG-nats-0 -f
Defaulted container "nats" out of: nats, reloader
[6] 2024/07/26 18:52:25.460906 [INF] Starting nats-server
[6] 2024/07/26 18:52:25.460957 [INF]   Version:  2.10.17
[6] 2024/07/26 18:52:25.460959 [INF]   Git:      [b91de03]
[6] 2024/07/26 18:52:25.460961 [INF]   Name:     ORG-nats-0
[6] 2024/07/26 18:52:25.460963 [INF]   ID:       NCN2JEE4LEXOZNC5OSYDVX6EW23EPPLLMNWPXOMINE3A564MUC77764W
[6] 2024/07/26 18:52:25.460968 [INF] Using configuration file: /etc/nats-config/nats.conf
[6] 2024/07/26 18:52:25.460970 [INF] Trusted Operators
[6] 2024/07/26 18:52:25.460971 [INF]   System  : ""
[6] 2024/07/26 18:52:25.460973 [INF]   Operator: "ORG"
[6] 2024/07/26 18:52:25.460975 [INF]   Issued  : 2024-07-15 16:39:52 +0000 UTC
[6] 2024/07/26 18:52:25.460989 [INF]   Expires : Never
[6] 2024/07/26 18:52:25.462384 [INF] Starting http monitor on 0.0.0.0:8222
[6] 2024/07/26 18:52:25.462518 [INF] Listening for client connections on 0.0.0.0:4222
[6] 2024/07/26 18:52:25.462528 [INF] TLS required for client connections
[6] 2024/07/26 18:52:25.462678 [INF] Server is ready
[6] 2024/07/26 18:52:49.307129 [ERR] 10.224.0.16:64083 - cid:5 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.16:64083: read: connection reset by peer
[6] 2024/07/26 18:52:49.601709 [ERR] 10.244.11.1:22990 - cid:6 - TLS handshake error: read tcp 10.244.11.6:4222->10.244.11.1:22990: read: connection reset by peer
[6] 2024/07/26 18:52:50.421799 [ERR] 10.224.0.12:15840 - cid:7 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.12:15840: read: connection reset by peer
[6] 2024/07/26 18:52:50.656785 [ERR] 10.224.0.4:59517 - cid:8 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.4:59517: read: connection reset by peer
[6] 2024/07/26 18:52:50.669588 [ERR] 10.224.0.7:35999 - cid:9 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.7:35999: read: connection reset by peer
[6] 2024/07/26 18:52:50.674150 [ERR] 10.224.0.14:46192 - cid:10 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.14:46192: read: connection reset by peer
[6] 2024/07/26 18:52:50.738035 [ERR] 10.224.0.17:39583 - cid:11 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.17:39583: read: connection reset by peer
[6] 2024/07/26 18:52:51.211579 [ERR] 10.224.0.15:32887 - cid:12 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.15:32887: read: connection reset by peer
[6] 2024/07/26 18:52:51.408095 [ERR] 10.224.0.13:45266 - cid:13 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.13:45266: read: connection reset by peer
[6] 2024/07/26 18:52:55.309357 [ERR] 10.224.0.16:6606 - cid:14 - TLS handshake error: read tcp 10.244.11.6:4222->10.224.0.16:6606: read: connection reset by peer
.....
```

`10.244.14.232` is the pod IP of the nats instnace
```bash
ORG-nats-0                      2/2     Running   0          15m     10.244.14.232   aks-default-36350781-vmss000068   <none>           <none>
```
All of the IPs that it attempting to dial look like `kube-system` pods, some of which share the same internal cluster ip
```bash
cloud-node-manager-nsbpx              1/1     Running   0          26d     10.224.0.12     aks-amd64-25378925-vmss000004     <none>           <none>
cloud-node-manager-qbhbs              1/1     Running   0          130m    10.224.0.10     aks-default-36350781-vmss0000a8   <none>           <none>
coredns-6745896b65-fv8df              1/1     Running   0          7d22h   10.244.14.170   aks-default-36350781-vmss000068   <none>           <none>
coredns-6745896b65-mtnmx              1/1     Running   0          17d     10.244.22.17    aks-amd64-25378925-vmss000000     <none>           <none>
coredns-6745896b65-tn749              1/1     Running   0          17d     10.244.0.13     aks-amd64-25378925-vmss000010     <none>           <none>
....
kube-proxy-2scnf                      1/1     Running   0          23d     10.224.0.14     aks-amd64-25378925-vmss000010     <none>           <none>
kube-proxy-b9m4d                      1/1     Running   0          26d     10.224.0.7      aks-amd64-25378925-vmss000000     <none>           <none>
kube-proxy-fmncs                      1/1     Running   0          26d     10.224.0.12     aks-amd64-25378925-vmss000004     <none>           <none>
kube-proxy-gk9dj                      1/1     Running   0          10d     10.224.0.16     aks-amd64-25378925-vmss000011     <none>           <none>
kube-proxy-h9vcs                      1/1     Running   0          27h     10.224.0.17     aks-default-36350781-vmss0000a4   <none>           <none>
....
```

This is where my understanding starts to break down, I'm not sure why nats would be dialing pods in the `kube-system` namespace. Would really appreciate any guidance you can provide! It almost feels like setting the `nats.tls.enabled` broke something with nats talking to other in-cluster pods. 

I looked at examples like https://gist.github.com/wallyqs/d9c9131a5bd5e247b2e4a6d4aac898af, but it seemed as though they were aimed at nats clustering. As far as I understand, it doesn't seem like I need self-signed routes config for a single pod deploy?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TLS Handshake error between Nats pod and `kube-system` pods #921

What version were you using?

What environment was the server running in?

Is this defect reproducible?

Given the capability you are leveraging, describe your expectation?

Given the expectation, what is the defect you are observing?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TLS Handshake error between Nats pod and kube-system pods #921

Description

What version were you using?

What environment was the server running in?

Is this defect reproducible?

Given the capability you are leveraging, describe your expectation?

Given the expectation, what is the defect you are observing?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

TLS Handshake error between Nats pod and `kube-system` pods #921