SQL Server Multi-Subnet Cluster connection failures

Summary
=======

We observe connection failures to `SQL Server Multi-Subnet Clusters` when the sql client application is running inside an ambient mesh and connecting to SQL Servers outside of the mesh.

I _think_ the cause is due to ztunnel's TCP proxy "optimistically" completing TCP handshakes between the downstream client (i.e. dotnet sql client) and ztunnel _prior_ to completing & validating the handshake between the upstream (i.e. sql server multi subnet cluster) and ztunnel.

Details
=====

My understanding of how SQL Server Multi-Subnet Clusters operate, is that they actively coordinate amongst each other so that only a single instance is listening for connections at any given time. SQL Server clients discover the cluster's servers via a DNS record, of which returns a set of IP addresses bound to the cluster (example response in a screenshot below).

After the client obtains the list of cluster IPs via DNS, it then, in parallel, attempts to open a connection to each of these IPs. In a normal connection flow, once the client successfully completes a TCP handshake to the single, active IP, it then uses the connection for the sql client session and discards the other connection attempts.

Now, when mixing in the ambient mesh, ztunnel's pass-through TCP proxy "optimistically" & "immediately" (10s of microseconds) completes the TCP handshake for _all_ connection attempts to _all_ cluster IPs, regardless if the upstream connection is possible.

Due to race conditions, the downstream side of the proxy (between the sql client & ztunnel) often completes its TCP handshake for the IP of the _inactive_ server.  At this point, the sql client believes it has successfully established a connection to the inactive server, and drops any other connection to the active server. Subsequent writes to this otherwise invalid connection fails as the upstream side of the connection cannot be established.


Random Thoughts on Possible Solutions
-----------------------------------------

There's probably better ideas, but to capture various options and references, here's some thoughts on how this might be resolved:

This could be solved if support is added for `excludeOutboundPorts (ref: https://github.com/istio/istio/issues/49829). We'd exclude the sql server port and force all other traffic through the mesh.

Another possible solution is to verify the upstream connection handshake prior completing to the downstream handshake... though I'd assume this would have other impacts and downsides.

Given that our workloads connection both to sql server and other services inside the mesh, we'd prefer not to opt these workloads out of the mesh entirely via the `istio.io/dataplane-mode: none` option, but worth a call out.

Packet Captures
============

DNS
----

DNS response for the multi subnet cluster.

IP 10.x.x.72 is _inactive_
IP 10.x.x.7 is _active_

![Image](https://github.com/user-attachments/assets/9deec9b5-256d-4c63-ad65-5ede1d01ffc3)

Success Scenario
------------------

Here's a typical connection sequence when the mesh is inactive, generated by a test app that opens a connection every ~two seconds.

Successful connection to active server on 10.x.x.7:

![Image](https://github.com/user-attachments/assets/ab4a3689-e6de-43d5-b698-d48fef635d08)

Failed connections to inactive server on 10.x.x.72:

![Image](https://github.com/user-attachments/assets/23e5c64c-c644-4399-92ab-9f06dc0f8c92)

Failure Scenario
----------------

Here's a connection attempt to the _invalid_ address, 10.x.x.72, for the downstream side of the connection between the sql client and ztunnel.
This shows the TCP handshake completing between the client and ztunne, the subsequent data packet, and finally connection termination due to no valid listener on the upstream side.

![Image](https://github.com/user-attachments/assets/069f7fe4-31dc-43e8-b1cc-972550609534)

And corresponding upstream packets between ztunnel and the inactive sql server on 10.x.x.72, which all fail:

![Image](https://github.com/user-attachments/assets/20cbc3b5-8351-49b9-b6f6-00a97ca20a5f)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SQL Server Multi-Subnet Cluster connection failures #1456

Summary

Details

Random Thoughts on Possible Solutions

Packet Captures

DNS

Success Scenario

Failure Scenario

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SQL Server Multi-Subnet Cluster connection failures #1456

Description

Summary

Details

Random Thoughts on Possible Solutions

Packet Captures

DNS

Success Scenario

Failure Scenario

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions